cochin university of science and technology cochin...

Ontology Based Information Retrieval

Department of Computer Science, CUSAT 1

COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY

COCHIN – 682022

2010

Seminar Report

On


Submitted By

Suja.S

In partial fulfillment of the requirement for the award of

Degree of Master of Technology (M.Tech)

In

Software Engineering



ABSTRACT

An ontology is a collection of concepts and their interrelationships, which provide an abstract view of an application domain. With regard to converting words to meaning the key issue is to identify appropriate concepts that both describe and identify documents, as well as language employed in user requests.An ontology-based information retrieval process, in which the retrieval system is conceptually interpret the meaning of the query, whereas the underlying domain ontology drives the conceptualisation process. In that way the retrieval process evolves from a query evaluation process into a highly interactive cooperation between a user and the retrieval system, in which the system tries to anticipate the user’s information need and to deliver the relevant content .

Keywords:Ontology,Information Retrieval,Conceptualisation



CONTENTS

1. Introduction 1

2. Ontology 2

2.1. Simple Definitions 2

2.2. Some Of the reasons for developing an Ontology 2

2.3. Ontology Development 3

2.4. Building Ontologies 6

3. Information Retrieval Process 9

4. Ontology Based Information Retrieval 11

5. Limitations of the Conventional Search Systems 13

6. Why a Meaning Based Approach? 14

7. Case Study - HAKIA 16

7.1. QDEX 17

7.2. Hakia Ontosem 21

7.3. Semantic Rank Algorithm 22

8. Conclusion 24

9. References 25



1. Introduction

An ontology-based information retrieval process, in which the retrieval system is

conceptually interpret the meaning of the query, whereas the underlying domain ontology

drives the conceptualisation process. In that way the retrieval process evolves from a query

evaluation process into a highly interactive cooperation between a user and the retrieval

system, in which the system tries to anticipate the user’s information need and to deliver the

relevant content .

Use of an ontology enables to define concepts and relations representing

knowledge about a particulardocument in domain specific terms. In order to express the

contents of a document explicitly, it isnecessary to create links (associations) between the

document and relevant parts of a domain model,i.e. links to those elements of the domain

model, which are relevant to the contents of the document.The ontology-based retrieval

model uses the logic-based matching function, it benefits from the perfect precision and recall

achieved in a logic-based retrieval Structure.



2. Ontology

Explicit formal specifications of the terms in the domain and relations

among them (Gruber 1993)—has been moving from the realm of Artificial-Intelligence

laboratories to the desktops of domain experts. Ontologies have become common on the

World-Wide Web. The WWW Consortium (W3C) is developing the Resource Description

Framework, a language for encoding knowledge on Web pages to make it understandable to

electronic agents searching for information. ). Many disciplines now develop standardized

ontologies that domain experts can use to share and annotate information in their fields.

Medicine, for example, has produced large, standardized, structured vocabularies such as

snomed and the semantic network of the Unified Medical Language System.

An ontology defines a common vocabulary for researchers who need to share

information in a domain. It includes machine-interpretable definitions of basic concepts in the

domain and relations among them.

2.1 Simple Definitions Three simple definitions are given below :

(1) Ontology is a term in philosophy and its meaning is "Theory of existence".

(2) A definition of an ontology in AI community is "An explicit representation of conceptualization".

(3) A definition of an ontology in KB community is "a theory of vocabulary/concepts

used as building artificial systems".

2.2 Some of the reasons for developing an ontology are:

a. To share common understanding of the structure of information among people or

software agents

Sharing common understanding of the structure of information among people or

software agents is one of the more common goals in developing ontologies. For

example, suppose several different Web sites contain medical information or provide

medical e-commerce services. If these Web sites share and publish the same

underlying ontology of the terms they all use, then computer agents can extract and



aggregate information from these different sites. The agents can use this aggregated

information to answer user queries or as input data to other applications.

b. To enable reuse of domain knowledge

Enabling reuse of domain knowledge was one of the driving forces behind recent

surge in ontology research. For example, models for many different domains need to

represent the notion of time. This representation includes the notions of time intervals,

points in time, relative measures of time, and so on. If one group of researchers

develops such an ontology in detail, others can simply reuse it for their domains.

c. To make domain assumptions explicit

Making explicit domain assumptions underlying an implementation makes it possible

to change these assumptions easily if our knowledge about the domain changes.

Explicit specifications of domain knowledge are useful for new users who must learn

what terms in the domain mean

d. To separate domain knowledge from the operational knowledge

Separating the domain knowledge from the operational knowledge is another common

use of ontologies.A task of configuring a product from its components according to a

required specification and implement a program that does this configuration

independent of the products and components themselves .Develop an ontology of PC-

components and characteristics and apply the algorithm to configure made-to-order

PCs. Also use the same algorithm to configure elevators if we “feed” an elevator

component ontology to it.

e. To analyze domain knowledge

Analyzing domain knowledge is possible once a declarative specification of the terms

is available. Formal analysis of terms is extremely valuable when both attempting to

reuse existing ontologies and extending them.

2.3 Ontology Development

Developing an ontology is a kind of defining a set of data and their structure for other

programs to use. Problem-solving methods, domain-independent applications, and software

agents use ontologies and knowledge bases built from ontologies as data. ontology



development is different from designing classes and relations in object-oriented

programming. Object-oriented programming centers primarily around methods on classes—a

programmer makes design decisions based on the operational properties of a class, whereas

an ontology designer makes these decisions based on the structural properties of a class.A

class structure and relations among classes in an ontology are different from the structure for

a similar domain in an object-oriented program.

An ontology is a formal explicit description of concepts in a domain of

discourse (classes (sometimes called concepts)), properties of each concept describing

various features and attributes of the concept (slots (sometimes called roles or properties)),

and restrictions on slots (facets (sometimes called role restrictions)). An ontology together

with a set of individual instances of classes constitutes a knowledge base.

Some fundamental rules in ontology design :

There is no one correct way to model a domain— there are always viable alternatives.

Ontology development is necessarily an iterative process

Concepts in the ontology should be close to objects (physical or logical) and

relationships in your domain of interest.

2.3.a.Ontology Components

Concepts - Set of entities within a domain

Relations - Interactions between concepts or concept properties.

Instances - Concrete examples of concepts in the domain

Axioms - Explicit rules to constrain the use of concepts.

Concepts

Set of entities within a domain ,Classes are the focus of most ontologies. Classes

describe concepts in the domain. For example, a class of wines represents all wines. Specific

wines are instances of this class. A class can have subclasses that represent concepts that are

more specific than the superclass .



Fig.1.Some Concepts in the Wine ontology

Relations

Slots describe properties of classes and instances. relationships to other

individuals; these are the relationships between individual members of the class and other

items (e.g., the maker of a wine, representing a relationship between a wine and a winery, and

the grape the wine is made from.)

Fig.2.Wine concepts and relationship

Instances

Concrete examples of concepts in the domain



Fig.3. Concepts,Relationships and Instances

Axioms

Explicit rules to constrain the use of concepts

Fig.4.Concepts,Relationships,Instances and Axioms

2.4.Building Ontologies

Six basic steps are involved in ontology Development

1)ontology scope

2)ontology capture



3)ontology encoding

4)ontology integration

5)ontology evaluation

6)ontology documentation

Step1:Scope

1)Identify the range of intended users

2)Determine the purpose of ontology

3)what questions the ontology should answer

4)Identify the user requirements for systems using the ontology

Example:Sample competency questions for the wine domain:

1)Is Bordeaux a red or White wine?

2)Which characteristics should i consider when choosing a wine?

3)what is the best choice of wine for grilled meat?

4)Which characterstics affect wine's appropriateness for a dish?

Step2:Capture

1)Identify the key concepts and relationships in the domain of interest

2)produce precise definitions for such concepts and relationships

3)Identify the terms to refer to such concepts and relationship

Wine Example : Capture

Concept Terms wine wine grape grape wine producer winery wine color color wine body body wine flavor flavor sugar content sugar level



Step3:Encoding

Explicit representation of the concepts in step2

1)meta-ontology-committing to the basic terms that will be used to specify the ontology

(eg:class,entity,relation)

2)choosing a representation language which is capable of supporting meta ontology

3)coding the ontology

Wine example:Encoding

Meta-Ontology could be the frame ontology:

Frame - class

slots - properties of the class

Facets - Constraints on the class properties

Possible Representation Languages:

1)Resource Description Language(RDF)

2)Web ontology Language(OWL)

3)Ontolingua

Step4:Integration

Integration with existing ontologies

1)Existing ontologies may be useful to build a new ontology

2)integration task is usually non trivial;the task is easier when available ontologies make

explicit all assumptions

3)possible integration strategy -identify synonyms in a given ontology and extend it where no

suitable concept exists

4)Existing terms may be used as-is,specialized,overridden,etc.



Step5:Evaluation

General versus specific evaluation criteria:

General criteria - Clarity,consistency and reusability;checking general criteria may be automated

Specific criteria - checking ontology against purpose,user requirements and competency questions

Ontology must be able to answer all given competency questions

Step6:Documentation

Documenting Ontology:

1)Effective knowledge sharing requires adequate documentation

2)All assumptions should be documented,both about main concepts and the meta -ontology

for representing the concepts

3)Documentation facilities in tools are simple but essential

3.Information Retrieval Process

Information Retrieval(IR) is the science and technology concerned with the

effective and efficient retrieval of information from an information repository for the

subsequent use by interested parties.The central problem in IR is the quest to find a set of

relevent information resources,amongst a large repository,containing the information sought

thereby satisfying an information need usually expressed by a user with a query.

There are three basic processes an information retrieval system has to support:

the representation of the content of the documents, the representation of the user’s

informationneed, and the comparison of the two representations.

The processes are visualised in Figure 1. In the figure, squared boxes

representdata and rounded boxes represent processes. Representing the documents is usually

called the indexing process. The process takes placeoff-line, that is, the end user of the



information retrieval system is not directly involved

Fig.5.Information Retrieval Processes

. The indexing process may include the actual storage of the document in the

system, but often documents are only stored partly, for instance only the title and the abstract,

plus information about the actual location of the document.The process of representing their

information need is often referred to as the query formulation process. The resulting

representation is the query.

The comparison of the query against the document representations is called the

matching process.The matching process usually results in a ranked list of documents. Users

will walk down this document list in search of the information they need. Ranked retrieval

will hopefully put the relevant documents towards the top of the ranked list, minimising the

time the user has the to invest in reading documents. Simple but effective ranking algorithms

use the frequency distri-bution of terms over documents, but also statistics over other

information, such as the number of hyperlinks that point to the document. Ranking

algorithms based on statistical approaches easily halve the time the user has to spend on

reading documents.



4.Ontology Based Information Retrieval

The ontology-based model for information retrieval redefines the task of IR as an

extraction from a given repository of information resources, of those resources r that, given

query q,makes the formula O|- r → q valid, where r and q are formulae of the chosen logic,

“→”denotes the brand of logical implications formalized by the logic in the question and O is

a set of logical sentences called domain knowledge (ontology).A derivability relationship |- is

defined between a set of formulae and a formula, if there exists a finite sequence of the

inference rules that leads the set of formula to that formula.

For the ontology-based IR, we have the following interpretation of the basic

retrieval model presented in the fig6:

1) LRes = ΚB(O), i.e. a resource is modelled as a set of relation instances (facts) from the

corresponding knowledge base. This set can be treated as one of instance assertions then

the relations (concepts) of which a fact is asserted to be an instance constitute altogether

the description of the resource.

Fig.6.Basic Ontology Based Retrieval Model



2) LQuery = Ω(O), i.e. a query is modelled as an ontology-based query Q(O); the intuitive

meaning of this choice is that all resources represented by facts retrieved for query Q(O),

i.e. the set of facts F(Q(O)), should be retrieved.

3) IR = I(O) ⊆ LRes , i.e. a repository (collection) of information resources represents a set

of all concept instantiations

4) M(I(O), Q(O)), the matching function between the repository and the given query,is

implemented through logical inference defined by the logical language used for the

representing ontology O.

The ontology-based retrieval model uses the logic-based matching function, it

benefits from the perfect precision and recall achieved in a logic-based retrieval Structure.

The document nodes represent the instances from the knowledge base of the domain

ontology(i.e. information resources from ontology-based information repository), dk ∈ I(O)

it is common to assume one-one correspondence between documents and texts. It means that

this dependency is complete, a text node is observed (tl = true) exactly when its parent

document is observed (dl = true).Moreover, an ontology operates on the conceptual level

where these differences are abstracted. Therefore, we consider document and text

representation nodes as identical. They are instances from the ontology: il ∈ I(O) .

Fig.7.Basic inference network for ontology-based IR

The concept representation nodes represent semantic interpretations of the instances.

These interpretations are defined through ontology relations, i.e. relation instances, rk ∈



KB(O).The representation concept nodes correspond to the expression of an information

need. Since in the ontology-based IR these primitive concepts are relation instances, query

concepts (ci) have the same meaning as the representation concepts so that each query

concept has exactly one parent According to the discussion presented in the previous section,

the relevance of that request should be established on the level of the meaning of a query.

The retrieval mechanism M is realized as an inference process, it uses several ontology

axioms in order to evaluate a query. Since each axiom depicts a different representation of a

query a cascade of depending axioms can build the network of intermediate query nodes (i.e.

qi ∈ Ω(O)). For example, regarding Institute example, the query for a researcher who works

in topic X can be represented using two queries: (1) a query for a researcher who researches

in a project and (2) a query for projects about X. It means that the user’s information need

can be better represented by using such a decomposition. However, a variety of relationships

may exist among the axioms (rules) in an ontology. Indeed, the conclusion of a rule may act

as conditions of other rules and different rules may share common conditions. One can

imagine a process in which all possible decompositions of a set of axioms are done. In this

way we obtain a list of elementary query representations whose validity should be inspected

regarding the concrete knowledge base in a retrieval process.

A query returns the set of concept instances as an answer, the relevance of these

answers is defined on the level of the relation instances. The reason is that the concept

instance is treated as an identifier of an object. whereas the relation instance represents the

property of that object whose relevance for the query can be determined.

5.Limitations of the Conventional Search Systems

Scalability is a Bottleneck

Popularity Ranking is limited

Second Search Problem

Scalability is a Bottleneck

Scalability of a search engine can be defined as the capability to respond to X number

of search requests per second using an index that covers Y number of Web pages. The

numbers X and Y for Google , for example, are somewhere around 3,000 and 8 billion,



respectively. each time a query hits the search engine, results are retrieved from the entire

index via Boolean operation and statistical enhancement of the results. This method of search

requires Z number of servers, where Z is a large number (in thousands) in Google. Most

search engines are set in a tight balance between X, Y, and Z. This tight balance restricts

performing any advanced analysis on‐the‐fly over. This tight balance restricts performing any

advanced analysis on‐the‐fly over the entire index.In addition, an index system with such a

tight balance has no room for semantically rich data.

Popularity Ranking is limited

To improve result relevancy beyond Boolean algebra, most search engines use

popularity scores to enrich indexed data. The simplest ranking method adds one data point

per record and does not severely disturb the scalability. Thus, results appear more relevant.

This seemingly harmless trade‐off, however, has serious limits.Popularity algorithms based

on collecting votes (that is, link referrals), cannot collect enough votes to form an opinion

score for every possible query that can be asked against the pages indexed. One common

result is that less‐popular topics are not ranked properly. A second common result concerns

newly emerged Web pages citing highly relevant information. Not around long enough to be

referred to by others, these pages get passed over.

Second Search Problem

An inherent limitation of the inverted index method is that for any given query,

search results point to the whole Web document as the destination. This means 2nd search,

the second being the search inside the documents. In stark contrast, precision search requires

focus: pointing to a section of the text, or preferably to a sentence. Accordingly, precision

search requires meaning‐based methods that analyze a paragraph or sentence as a whole, and

displays the results accordingly. With such a capability, the need for opening the document

(2nd search) will be eliminated most of the time.Evolving from "document‐level" search to

"sentence‐level" search is one of the challenges that the future search engines must meet.

6.Why A Meaning‐Based Approach?

To understand why we need a meaning‐based approach for more relevant search

results How can an indexing system be limited if it captures every word on a Web page?

Reflect on the following analogy. Assume that a Web page contains two circles and two lines



as shown in Fig 1. An indexing system will record that this page has two lines and two

circles. It can even capture the order in which these objects are lined up. What is missing

from this capturing process is the human interpretation of this figure, such as "In birds‐eye

view, the figure looks like someone with a large hat is riding a bicycle. Possibly Mexican or

Chinese, the person must be in motion." This is the “meaning” an indexing system does not

capture by definition.As a result, search precision is severely compromised with the indexing

method. A meaning‐based method, however, would rightfully bring the shape in Fig 1 as a

possible answer to a query about riding bicycles. Similarly, let's consider a text version of the

same analogy. If a Web page has the following sentence:”Pain killers help treat mild

headache”.

Word Order ID

Pain 1 X

Killers 2 X

help 3 X

treat 4 X

mild 5 X

headache 6 X

An indexing system would capture all the words above, their order, and the document ID as

shown above.But, as in the previous example, there is more to it than a simple list. For

example: Pain killer is a two‐word combination term which represents a class of drugs such

as Aspirin, Ibuprofen, Tylenol, and many others. Help is a modifying event that functions in

this particular context like "may, can, or possibly, as a facilitator.” Treat is a medical event

related to heal, cure, or relieve. Mild is a modifier. Headache is a medical condition related to

head and includes different forms such as migraine, stiff neck, dullness, etc.The description

above reflects the human thought process which puts all these pieces together to infer the

meaning of the sentence and the parts of it. For example, the word "killer" is not a murderer,

"treat" is not a candy, and “help" is not a hand out in this context.Based on the interpretation

above, lets consider the following variations:

Aspirin may relieve mild pain in the head.

Tylenol can possibly help migraine.

Ibuprofen may heal headache.

Many equally meaningful variations and inferences can be added to this list. Therefore, a

mere collection of the words Painkillers help treat mild headache is only one particular

description of a larger set of information. This is not an exception, and applies to all

sentences in natural languages.



Back of the envelope calculation

The sentence "Pain killers help treat mild headache" can be expressed in multiple

ways. Here are the number of equivalent forms: Pain killers (20) help treat (5) mild (3)

headache (5). This means that the total permutation is 20 x 5 x 3 x 5 = 1500. By asking "what

pain killers help treat mild migraine?" like we can actually asking 1500 different expressions

of the same/relevant information.

Challenges In Ontology Technology

Isolate scalability of the retrieval process from data storage problems.

Invent a storage system that grows only by a new technology

Make Search and acquisition of new data fast

7.Case Study : hakia

Hakia is an Ontology based sematic Internet search engine. The company has

invented QDEXing technology, an alternative new infrastructure to indexing that uses

SemanticRank algorithm, a solution mix from the disciplines of ontological semantics, fuzzy

logic, computational linguistics, and mathematics. Founded in 2004, the company is privately

held and based in New York City. Hakia was founded by Riza Berkan, a nuclear scientist by

training with a specialization in artificial intelligence and fuzzy logic, and Pentti Kouri, a

New York-based economist and venture capitalist. Professor Victor Raskin, a father of

ontological semantics and noted international authority in the field of computational

linguistics, serves as hakia’s scientific advisor.

The basic promise of the semantic search technology is to improve search

performance and to bring search experience as close as possible to the way humans interact in

real life. A search engine deploying semantic search technology must be able to understand text

and query using principles similar to that of the human brain. hakia's current market positioning is

centered around the capabilities of semantic technology that cannot be easily duplicated by the

conventional systems. For example, searching dynamic content (like news or emerging journal

articles), and searching credible databases represent a problem for popularity‐based search

engines due to lack of statistical sampling.



The search space is composed by documents.They are not restricted to any

particular repository.NLP techniques are applied to documents to extract knowledge.Hakia

uses its own ontology to store the extracted knowledge and exploit it during the search

process.

The Semantics Behind Hakia

• Ontological Semantics(OntoSem)

• Query Detection And Extraction(QDEX)

• Semantic Rank

7.1.QDEX (Query Detection And Extraction)

The underlying principle of the QDEX method is described by two questions: Given

a sentence from a Web page, how many questions can be asked so that this sentence will be a

potential answer? And what are these question sequences?A basic premise is that each

question sequence found above is already equivalent to an intersection set in a conventional

index, and therefore represents an answer node. Collecting all cases of the same sequence

from other Web pages, then limiting the list to the best answer candidates via competition

criteria completes the QDEX method.With this approach, all the questions that can be asked

by a user, and those that would find answers from analyzed Web pages, are already extracted

(anticipated) and prepared by the QDEX algorithm before the user even comes to the search

engine. This mosaic‐like data structure is depicted in Figure.

Fig.8.QDEX Method



Indexing and QDEX methods differ as shown in Fig 3, where the index table is on the left

and QDEX lists are on the right. In this example, the blue colored QDEX list corresponds to a

particular word combination (sequence) and red corresponds to another. How these word

combinations are formed is described in the next section.

The first important advantage of this approach is that the task of extracting a result from a

large index on the fly is eliminated, replaced by the task of accessing a single small list (file) that

includes results already prioritized off‐line. Thus, QDEX is a highly distributed system where

each distributed item has no functional dependence on the rest of the system at any given time.

This system results in superb flexibility of data storage and facilitates its scalability.

The second important advantage is the growth phenomenon. An inverted index grows

linearly with the introduction of every new Web page, forever. The QDEX method grows only

with the introduction of new knowledge (unique sequences). If the existing knowledge is repeated

in a newly acquired Web page, it is subject to competition based on credibility criteria. The

QDEX idea poses several challenges that were taken on by the hakia team.

Fig.9. Comparison of Indexing and QDEXing methods

Combinatory Explosion Problem

Let's assume a URL was crawled, its Web page was taken in, and each sentence

on that page was analyzed. Suppose we have the following sentence with 8 significant

words:

In Norse mythology the polar aurora represents the Ride of the Valkyries to War.

Looking at this sentence, a human might compose several queries:

aurora ,mythology ,Valkyries,Norse Mythology,polar aurora,What is aurora? ,What is

Norse?,What is Norse Mythology?,What is polar aurora?,Who were the Valkyries?,In what



mythology is aurora represented?,Ride of the Valkyries to War,What does aurora represent in

mythology? …etc .In fact, there are about 50 questions/queries that someone could ask against

this sentence. Eliminating the noise words from the list above and accounting for repetition

yields 35 sequences of words comprising around 120 total words. Compared to the 8 words

and their positions an indexing system would store, we have a much larger storage

requirement here (about 8 times) for this particular example.

Each sequence in QDEX, however, will stand alone like an answer node, and will

repeat in other places, so the total storage can reduce down from 8 times to a smaller fraction.

For example, the sequence "Norse Mythology" will have its own competing list, and will

contain occurrences in other documents, thus it will not increase the counter linearly.

Therefore, all the repeating sequences (the permutations of these words that are statistically

meaningful) will be retained in the system, and eventually the storage requirement will be

manageable.The problem is how to create the 35 sequences from the given sentence. The

combinatory explosion phenomenon starts here.Mathematically speaking, if we take 8 words,

and try all permutations, the total number of sequences will be around 500,000. The question

is how to reduce 500,000 sequences to 35 meaningful ones just as the human brain does.To

any linguist, it would be clear that employing grammar or syntactic algorithms would not

even get close to solving the problem. It has to be a meaning‐based approach either in its

full ‐scale deployment such as OntoSem, or its approximated versions using fuzzy logic.

Breeding Sequences

Given a sentence such as the one above, the generation of sequences by a computer

algorithm that mimics human‐like sequence‐generation is called "breeding" at hakia.

Fig.10. QDEX Sequence Breeding versus the human brain



QDEX sequence breeding follows a process analogous to the how the human brain learns. the

human brain understands the meaning of a text and stores a reasonable number of knowledge

bits, which we call “answer nodes.”The sequence breeder of hakia comes up with a

comparable number of meaningful sequences (answer nodes) without getting lost in the

permutation space.

To accomplish the first stage approximation to breeding, hakia has deployed a

fuzzy logic solution. This approach uses a simple word identification method called

"bag‐of‐words"(BOW), and uses fuzzy rules [5] to model the basic principles of meaning.

hakia's current breeder can estimate up to 80% of the human‐like sequences with an

overshoot rate between 7 and 15. During the QDEX process, the overshooting sequences that

are not very useful eventually drop out from the system if they do not repeat somewhere

else.Deletion of QDEX sequences resembles the "forgetting" process in the human brain

theoretically happening during sleep via the dreaming process. Regardless of the validity of

this theory, we call this process "QDEX dreaming"

The second stage of breeding involves OntoSem, and has two primary functions:

(1) validate the meaningfulness of QDEX sequences, (2) create equivalent forms of the same

sequences using different words.

Fig.11. Refinement of the QDEX sequences

The two primary functions, Breeding and Ontology, represent the critical difference between a

meaning‐based search engine and its index‐based cousins. Hakia’s intricate breeding process is a

trade secret.



QDEX Files

Each QDEX sequence, like "Norse Mythology," is stored in a file that contains the

paragraph ID of the originating source. Since each QDEX file represents an answer node, the

entries inside are limited to a reasonable number. If the analysis of a new document results in the

same QDEX sequence, then the paragraph ID is inserted in the QDEX file in a place based on

several "credibility" and "quality" criteria. These criteria are also hakia's trade secret. If the file is

full, then the lowest position entry drops out to make a room for another entry that is deemed to

be better. Consequently, you end up with a QDEX file for Norse Mythology that contains a best

set of paragraph IDs refined from millions of documents.

QDEX Storage

QDEX’s unusual storage problem was solved by a unique approach. First, QDEX

files are named by alphabetizing the words of the QDEX sequence. Second, the files are

stored in a vast array of servers via hash+mode coding, which ensures even distribution and

immediate destination coding based on file name.Because the QDEX method is data I/O

intensive, a special three way redundancy architecture was designed, called KUTS. The

KUTS system has highly distributed back‐end storage devices that constantly write new data

on disks. A group of storage devices are designated to serve only, and not to interfere with

the disk‐write process. At scheduled intervals, the back‐end is image‐copied to the service

storage devices.

7.2.Hakia Ontosem

“Ontological Semantics (OntoSem) is a formal and comprehensive linguistic theory of

meaning in natural language. As such, it bears significantly on philosophy of language,

mathematical logic, and cognitive science. It is a rapidly growing area of intense academic

research and of active practical implementations, of which hakia.com is by far the leading

one.OntoSem offers an advanced methodology and technology for natural language

processing, the only one of its kind, so far, to access the full meaning of the text it handles. As

such, it is also a set of well-developed and constantly improving resources, including a

language-independent ontology of thousands of interrelated concepts; an ontology-based

English lexicon of 100,000 word senses, and counting (plus, the lexicons for several other

languages under construction); and an ontological parser which ‘translates’ every sentence of

the text into its text meaning representation, approximating the complete understanding of the



sentence by the native speaker.” “hakia OntoSem is a modular, extensible, and adaptable

toolset for government, business, education and research applications to enable developers to

use the meaning of language and not just text string matching for applications, including:

• information retrieval, analysis, and distribution

• text summarization

• information assurance and security

• machine translation

• ontology support

• terminology standardization

• supply chain automation

7.3. Semantic Rank Algorithm

Having solved retrieval scalability via single file request, hakia adds a layer for an

on‐the‐fly analysis called SemanticRank which is an independent module .SemanticRank

analysis refines the results by locating the best sentences in the paragraph that match the

query. This process uses syntactic, ontological, and morphological solutions.

Fig.12. SemanticRank Algorithm



A typical example is shown below. The query was "why did Enron collapse?"

It is immediately visible that a typical result coming from hakia is focused on a

sentence (highlighted) by on‐the‐fly NLP process that selects the best sentence. The result is

maintained as one uninterrupted piece of text. The same query to Google brings results in the

following format:

In Google, there is no sentence selection due to lack of semantics, and the text

snippet is often interrupted by an ellipsis.

Among the many capabilities of the SemanticRank algorithm, a key feature is the

query type detection. For example, if the query is a “why” question, then the SemanticRank

algorithm can formulate a special QDEX file request that would add sense information such as

"reason, cause" Accordingly, the question "why did Enron collapse?" is converted into a QDEX

file request that has the correct sense. Hakia's SemanticRank algorithm can identify more than 60

different question types, almost all the possible types in the English language. Question type (and

sense) detection also allows requesting more than a single QDEX file (fall‐backs) to retrieve

more than one answer nodes both of which are deemed applicable.



Conclusion

Ontology based IR is a growing technology,which relies on semantics. Nowadays,the

ontology development is not an open source activity.Hakia intends to release the QDEX

system in an open‐source agreement in the future. The objective is to encourage scientists and

developers to improve the QDEX method in their own way . Ontologies with semantics is

used for the development of sematic web technology.Though it has many practical challenges

it will be completely developed in the coming years.



References

[1]. David Vallet, Miriam Fernández, and Pablo Castells- An Ontology-Based Information

Retrieval Model-http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.4633&rep

[2]. Natalya F. Noy and Deborah L. McGuinness-A Guide to Creating Your First

Ontologyhttp://protege.stanford.edu/publications/ontology_development/ontology101-noy-

mcguinness.html

[3].M.Sc. Nenad Stojanovic-Ontology-based Information Retrieval:Methods and Tools for

Cooperative Query Answering- White paper_ semantic _search_ technology.pdf

cochin university of science and technology cochin...

Documents