cochin university of science and technology cochin...
TRANSCRIPT
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 1
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
COCHIN – 682022
2010
Seminar Report
On
Ontology Based Information Retrieval
Submitted By
Suja.S
In partial fulfillment of the requirement for the award of
Degree of Master of Technology (M.Tech)
In
Software Engineering
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 2
ABSTRACT
An ontology is a collection of concepts and their interrelationships, which provide an abstract view of an application domain. With regard to converting words to meaning the key issue is to identify appropriate concepts that both describe and identify documents, as well as language employed in user requests.An ontology-based information retrieval process, in which the retrieval system is conceptually interpret the meaning of the query, whereas the underlying domain ontology drives the conceptualisation process. In that way the retrieval process evolves from a query evaluation process into a highly interactive cooperation between a user and the retrieval system, in which the system tries to anticipate the user’s information need and to deliver the relevant content .
Keywords:Ontology,Information Retrieval,Conceptualisation
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 3
CONTENTS
1. Introduction 1
2. Ontology 2
2.1. Simple Definitions 2
2.2. Some Of the reasons for developing an Ontology 2
2.3. Ontology Development 3
2.4. Building Ontologies 6
3. Information Retrieval Process 9
4. Ontology Based Information Retrieval 11
5. Limitations of the Conventional Search Systems 13
6. Why a Meaning Based Approach? 14
7. Case Study - HAKIA 16
7.1. QDEX 17
7.2. Hakia Ontosem 21
7.3. Semantic Rank Algorithm 22
8. Conclusion 24
9. References 25
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 4
1. Introduction
An ontology-based information retrieval process, in which the retrieval system is
conceptually interpret the meaning of the query, whereas the underlying domain ontology
drives the conceptualisation process. In that way the retrieval process evolves from a query
evaluation process into a highly interactive cooperation between a user and the retrieval
system, in which the system tries to anticipate the user’s information need and to deliver the
relevant content .
Use of an ontology enables to define concepts and relations representing
knowledge about a particulardocument in domain specific terms. In order to express the
contents of a document explicitly, it isnecessary to create links (associations) between the
document and relevant parts of a domain model,i.e. links to those elements of the domain
model, which are relevant to the contents of the document.The ontology-based retrieval
model uses the logic-based matching function, it benefits from the perfect precision and recall
achieved in a logic-based retrieval Structure.
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 5
2. Ontology
Explicit formal specifications of the terms in the domain and relations
among them (Gruber 1993)—has been moving from the realm of Artificial-Intelligence
laboratories to the desktops of domain experts. Ontologies have become common on the
World-Wide Web. The WWW Consortium (W3C) is developing the Resource Description
Framework, a language for encoding knowledge on Web pages to make it understandable to
electronic agents searching for information. ). Many disciplines now develop standardized
ontologies that domain experts can use to share and annotate information in their fields.
Medicine, for example, has produced large, standardized, structured vocabularies such as
snomed and the semantic network of the Unified Medical Language System.
An ontology defines a common vocabulary for researchers who need to share
information in a domain. It includes machine-interpretable definitions of basic concepts in the
domain and relations among them.
2.1 Simple Definitions Three simple definitions are given below :
(1) Ontology is a term in philosophy and its meaning is "Theory of existence".
(2) A definition of an ontology in AI community is "An explicit representation of conceptualization".
(3) A definition of an ontology in KB community is "a theory of vocabulary/concepts
used as building artificial systems".
2.2 Some of the reasons for developing an ontology are:
a. To share common understanding of the structure of information among people or
software agents
Sharing common understanding of the structure of information among people or
software agents is one of the more common goals in developing ontologies. For
example, suppose several different Web sites contain medical information or provide
medical e-commerce services. If these Web sites share and publish the same
underlying ontology of the terms they all use, then computer agents can extract and
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 6
aggregate information from these different sites. The agents can use this aggregated
information to answer user queries or as input data to other applications.
b. To enable reuse of domain knowledge
Enabling reuse of domain knowledge was one of the driving forces behind recent
surge in ontology research. For example, models for many different domains need to
represent the notion of time. This representation includes the notions of time intervals,
points in time, relative measures of time, and so on. If one group of researchers
develops such an ontology in detail, others can simply reuse it for their domains.
c. To make domain assumptions explicit
Making explicit domain assumptions underlying an implementation makes it possible
to change these assumptions easily if our knowledge about the domain changes.
Explicit specifications of domain knowledge are useful for new users who must learn
what terms in the domain mean
d. To separate domain knowledge from the operational knowledge
Separating the domain knowledge from the operational knowledge is another common
use of ontologies.A task of configuring a product from its components according to a
required specification and implement a program that does this configuration
independent of the products and components themselves .Develop an ontology of PC-
components and characteristics and apply the algorithm to configure made-to-order
PCs. Also use the same algorithm to configure elevators if we “feed” an elevator
component ontology to it.
e. To analyze domain knowledge
Analyzing domain knowledge is possible once a declarative specification of the terms
is available. Formal analysis of terms is extremely valuable when both attempting to
reuse existing ontologies and extending them.
2.3 Ontology Development
Developing an ontology is a kind of defining a set of data and their structure for other
programs to use. Problem-solving methods, domain-independent applications, and software
agents use ontologies and knowledge bases built from ontologies as data. ontology
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 7
development is different from designing classes and relations in object-oriented
programming. Object-oriented programming centers primarily around methods on classes—a
programmer makes design decisions based on the operational properties of a class, whereas
an ontology designer makes these decisions based on the structural properties of a class.A
class structure and relations among classes in an ontology are different from the structure for
a similar domain in an object-oriented program.
An ontology is a formal explicit description of concepts in a domain of
discourse (classes (sometimes called concepts)), properties of each concept describing
various features and attributes of the concept (slots (sometimes called roles or properties)),
and restrictions on slots (facets (sometimes called role restrictions)). An ontology together
with a set of individual instances of classes constitutes a knowledge base.
Some fundamental rules in ontology design :
There is no one correct way to model a domain— there are always viable alternatives.
Ontology development is necessarily an iterative process
Concepts in the ontology should be close to objects (physical or logical) and
relationships in your domain of interest.
2.3.a.Ontology Components
Concepts - Set of entities within a domain
Relations - Interactions between concepts or concept properties.
Instances - Concrete examples of concepts in the domain
Axioms - Explicit rules to constrain the use of concepts.
Concepts
Set of entities within a domain ,Classes are the focus of most ontologies. Classes
describe concepts in the domain. For example, a class of wines represents all wines. Specific
wines are instances of this class. A class can have subclasses that represent concepts that are
more specific than the superclass .
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 8
Fig.1.Some Concepts in the Wine ontology
Relations
Slots describe properties of classes and instances. relationships to other
individuals; these are the relationships between individual members of the class and other
items (e.g., the maker of a wine, representing a relationship between a wine and a winery, and
the grape the wine is made from.)
Fig.2.Wine concepts and relationship
Instances
Concrete examples of concepts in the domain
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 9
Fig.3. Concepts,Relationships and Instances
Axioms
Explicit rules to constrain the use of concepts
Fig.4.Concepts,Relationships,Instances and Axioms
2.4.Building Ontologies
Six basic steps are involved in ontology Development
1)ontology scope
2)ontology capture
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 10
3)ontology encoding
4)ontology integration
5)ontology evaluation
6)ontology documentation
Step1:Scope
1)Identify the range of intended users
2)Determine the purpose of ontology
3)what questions the ontology should answer
4)Identify the user requirements for systems using the ontology
Example:Sample competency questions for the wine domain:
1)Is Bordeaux a red or White wine?
2)Which characteristics should i consider when choosing a wine?
3)what is the best choice of wine for grilled meat?
4)Which characterstics affect wine's appropriateness for a dish?
Step2:Capture
1)Identify the key concepts and relationships in the domain of interest
2)produce precise definitions for such concepts and relationships
3)Identify the terms to refer to such concepts and relationship
Wine Example : Capture
Concept Terms wine wine grape grape wine producer winery wine color color wine body body wine flavor flavor sugar content sugar level
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 11
Step3:Encoding
Explicit representation of the concepts in step2
1)meta-ontology-committing to the basic terms that will be used to specify the ontology
(eg:class,entity,relation)
2)choosing a representation language which is capable of supporting meta ontology
3)coding the ontology
Wine example:Encoding
Meta-Ontology could be the frame ontology:
Frame - class
slots - properties of the class
Facets - Constraints on the class properties
Possible Representation Languages:
1)Resource Description Language(RDF)
2)Web ontology Language(OWL)
3)Ontolingua
Step4:Integration
Integration with existing ontologies
1)Existing ontologies may be useful to build a new ontology
2)integration task is usually non trivial;the task is easier when available ontologies make
explicit all assumptions
3)possible integration strategy -identify synonyms in a given ontology and extend it where no
suitable concept exists
4)Existing terms may be used as-is,specialized,overridden,etc.
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 12
Step5:Evaluation
General versus specific evaluation criteria:
General criteria - Clarity,consistency and reusability;checking general criteria may be automated
Specific criteria - checking ontology against purpose,user requirements and competency questions
Ontology must be able to answer all given competency questions
Step6:Documentation
Documenting Ontology:
1)Effective knowledge sharing requires adequate documentation
2)All assumptions should be documented,both about main concepts and the meta -ontology
for representing the concepts
3)Documentation facilities in tools are simple but essential
3.Information Retrieval Process
Information Retrieval(IR) is the science and technology concerned with the
effective and efficient retrieval of information from an information repository for the
subsequent use by interested parties.The central problem in IR is the quest to find a set of
relevent information resources,amongst a large repository,containing the information sought
thereby satisfying an information need usually expressed by a user with a query.
There are three basic processes an information retrieval system has to support:
the representation of the content of the documents, the representation of the user’s
informationneed, and the comparison of the two representations.
The processes are visualised in Figure 1. In the figure, squared boxes
representdata and rounded boxes represent processes. Representing the documents is usually
called the indexing process. The process takes placeoff-line, that is, the end user of the
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 13
information retrieval system is not directly involved
Fig.5.Information Retrieval Processes
. The indexing process may include the actual storage of the document in the
system, but often documents are only stored partly, for instance only the title and the abstract,
plus information about the actual location of the document.The process of representing their
information need is often referred to as the query formulation process. The resulting
representation is the query.
The comparison of the query against the document representations is called the
matching process.The matching process usually results in a ranked list of documents. Users
will walk down this document list in search of the information they need. Ranked retrieval
will hopefully put the relevant documents towards the top of the ranked list, minimising the
time the user has the to invest in reading documents. Simple but effective ranking algorithms
use the frequency distri-bution of terms over documents, but also statistics over other
information, such as the number of hyperlinks that point to the document. Ranking
algorithms based on statistical approaches easily halve the time the user has to spend on
reading documents.
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 14
4.Ontology Based Information Retrieval
The ontology-based model for information retrieval redefines the task of IR as an
extraction from a given repository of information resources, of those resources r that, given
query q,makes the formula O|- r → q valid, where r and q are formulae of the chosen logic,
“→”denotes the brand of logical implications formalized by the logic in the question and O is
a set of logical sentences called domain knowledge (ontology).A derivability relationship |- is
defined between a set of formulae and a formula, if there exists a finite sequence of the
inference rules that leads the set of formula to that formula.
For the ontology-based IR, we have the following interpretation of the basic
retrieval model presented in the fig6:
1) LRes = ΚB(O), i.e. a resource is modelled as a set of relation instances (facts) from the
corresponding knowledge base. This set can be treated as one of instance assertions then
the relations (concepts) of which a fact is asserted to be an instance constitute altogether
the description of the resource.
Fig.6.Basic Ontology Based Retrieval Model
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 15
2) LQuery = Ω(O), i.e. a query is modelled as an ontology-based query Q(O); the intuitive
meaning of this choice is that all resources represented by facts retrieved for query Q(O),
i.e. the set of facts F(Q(O)), should be retrieved.
3) IR = I(O) ⊆ LRes , i.e. a repository (collection) of information resources represents a set
of all concept instantiations
4) M(I(O), Q(O)), the matching function between the repository and the given query,is
implemented through logical inference defined by the logical language used for the
representing ontology O.
The ontology-based retrieval model uses the logic-based matching function, it
benefits from the perfect precision and recall achieved in a logic-based retrieval Structure.
The document nodes represent the instances from the knowledge base of the domain
ontology(i.e. information resources from ontology-based information repository), dk ∈ I(O)
it is common to assume one-one correspondence between documents and texts. It means that
this dependency is complete, a text node is observed (tl = true) exactly when its parent
document is observed (dl = true).Moreover, an ontology operates on the conceptual level
where these differences are abstracted. Therefore, we consider document and text
representation nodes as identical. They are instances from the ontology: il ∈ I(O) .
Fig.7.Basic inference network for ontology-based IR
The concept representation nodes represent semantic interpretations of the instances.
These interpretations are defined through ontology relations, i.e. relation instances, rk ∈
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 16
KB(O).The representation concept nodes correspond to the expression of an information
need. Since in the ontology-based IR these primitive concepts are relation instances, query
concepts (ci) have the same meaning as the representation concepts so that each query
concept has exactly one parent According to the discussion presented in the previous section,
the relevance of that request should be established on the level of the meaning of a query.
The retrieval mechanism M is realized as an inference process, it uses several ontology
axioms in order to evaluate a query. Since each axiom depicts a different representation of a
query a cascade of depending axioms can build the network of intermediate query nodes (i.e.
qi ∈ Ω(O)). For example, regarding Institute example, the query for a researcher who works
in topic X can be represented using two queries: (1) a query for a researcher who researches
in a project and (2) a query for projects about X. It means that the user’s information need
can be better represented by using such a decomposition. However, a variety of relationships
may exist among the axioms (rules) in an ontology. Indeed, the conclusion of a rule may act
as conditions of other rules and different rules may share common conditions. One can
imagine a process in which all possible decompositions of a set of axioms are done. In this
way we obtain a list of elementary query representations whose validity should be inspected
regarding the concrete knowledge base in a retrieval process.
A query returns the set of concept instances as an answer, the relevance of these
answers is defined on the level of the relation instances. The reason is that the concept
instance is treated as an identifier of an object. whereas the relation instance represents the
property of that object whose relevance for the query can be determined.
5.Limitations of the Conventional Search Systems
Scalability is a Bottleneck
Popularity Ranking is limited
Second Search Problem
Scalability is a Bottleneck
Scalability of a search engine can be defined as the capability to respond to X number
of search requests per second using an index that covers Y number of Web pages. The
numbers X and Y for Google , for example, are somewhere around 3,000 and 8 billion,
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 17
respectively. each time a query hits the search engine, results are retrieved from the entire
index via Boolean operation and statistical enhancement of the results. This method of search
requires Z number of servers, where Z is a large number (in thousands) in Google. Most
search engines are set in a tight balance between X, Y, and Z. This tight balance restricts
performing any advanced analysis on‐the‐fly over. This tight balance restricts performing any
advanced analysis on‐the‐fly over the entire index.In addition, an index system with such a
tight balance has no room for semantically rich data.
Popularity Ranking is limited
To improve result relevancy beyond Boolean algebra, most search engines use
popularity scores to enrich indexed data. The simplest ranking method adds one data point
per record and does not severely disturb the scalability. Thus, results appear more relevant.
This seemingly harmless trade‐off, however, has serious limits.Popularity algorithms based
on collecting votes (that is, link referrals), cannot collect enough votes to form an opinion
score for every possible query that can be asked against the pages indexed. One common
result is that less‐popular topics are not ranked properly. A second common result concerns
newly emerged Web pages citing highly relevant information. Not around long enough to be
referred to by others, these pages get passed over.
Second Search Problem
An inherent limitation of the inverted index method is that for any given query,
search results point to the whole Web document as the destination. This means 2nd search,
the second being the search inside the documents. In stark contrast, precision search requires
focus: pointing to a section of the text, or preferably to a sentence. Accordingly, precision
search requires meaning‐based methods that analyze a paragraph or sentence as a whole, and
displays the results accordingly. With such a capability, the need for opening the document
(2nd search) will be eliminated most of the time.Evolving from "document‐level" search to
"sentence‐level" search is one of the challenges that the future search engines must meet.
6.Why A Meaning‐Based Approach?
To understand why we need a meaning‐based approach for more relevant search
results How can an indexing system be limited if it captures every word on a Web page?
Reflect on the following analogy. Assume that a Web page contains two circles and two lines
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 18
as shown in Fig 1. An indexing system will record that this page has two lines and two
circles. It can even capture the order in which these objects are lined up. What is missing
from this capturing process is the human interpretation of this figure, such as "In birds‐eye
view, the figure looks like someone with a large hat is riding a bicycle. Possibly Mexican or
Chinese, the person must be in motion." This is the “meaning” an indexing system does not
capture by definition.As a result, search precision is severely compromised with the indexing
method. A meaning‐based method, however, would rightfully bring the shape in Fig 1 as a
possible answer to a query about riding bicycles. Similarly, let's consider a text version of the
same analogy. If a Web page has the following sentence:”Pain killers help treat mild
headache”.
Word Order ID
Pain 1 X
Killers 2 X
help 3 X
treat 4 X
mild 5 X
headache 6 X
An indexing system would capture all the words above, their order, and the document ID as
shown above.But, as in the previous example, there is more to it than a simple list. For
example: Pain killer is a two‐word combination term which represents a class of drugs such
as Aspirin, Ibuprofen, Tylenol, and many others. Help is a modifying event that functions in
this particular context like "may, can, or possibly, as a facilitator.” Treat is a medical event
related to heal, cure, or relieve. Mild is a modifier. Headache is a medical condition related to
head and includes different forms such as migraine, stiff neck, dullness, etc.The description
above reflects the human thought process which puts all these pieces together to infer the
meaning of the sentence and the parts of it. For example, the word "killer" is not a murderer,
"treat" is not a candy, and “help" is not a hand out in this context.Based on the interpretation
above, lets consider the following variations:
Aspirin may relieve mild pain in the head.
Tylenol can possibly help migraine.
Ibuprofen may heal headache.
Many equally meaningful variations and inferences can be added to this list. Therefore, a
mere collection of the words Painkillers help treat mild headache is only one particular
description of a larger set of information. This is not an exception, and applies to all
sentences in natural languages.
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 19
Back of the envelope calculation
The sentence "Pain killers help treat mild headache" can be expressed in multiple
ways. Here are the number of equivalent forms: Pain killers (20) help treat (5) mild (3)
headache (5). This means that the total permutation is 20 x 5 x 3 x 5 = 1500. By asking "what
pain killers help treat mild migraine?" like we can actually asking 1500 different expressions
of the same/relevant information.
Challenges In Ontology Technology
Isolate scalability of the retrieval process from data storage problems.
Invent a storage system that grows only by a new technology
Make Search and acquisition of new data fast
7.Case Study : hakia
Hakia is an Ontology based sematic Internet search engine. The company has
invented QDEXing technology, an alternative new infrastructure to indexing that uses
SemanticRank algorithm, a solution mix from the disciplines of ontological semantics, fuzzy
logic, computational linguistics, and mathematics. Founded in 2004, the company is privately
held and based in New York City. Hakia was founded by Riza Berkan, a nuclear scientist by
training with a specialization in artificial intelligence and fuzzy logic, and Pentti Kouri, a
New York-based economist and venture capitalist. Professor Victor Raskin, a father of
ontological semantics and noted international authority in the field of computational
linguistics, serves as hakia’s scientific advisor.
The basic promise of the semantic search technology is to improve search
performance and to bring search experience as close as possible to the way humans interact in
real life. A search engine deploying semantic search technology must be able to understand text
and query using principles similar to that of the human brain. hakia's current market positioning is
centered around the capabilities of semantic technology that cannot be easily duplicated by the
conventional systems. For example, searching dynamic content (like news or emerging journal
articles), and searching credible databases represent a problem for popularity‐based search
engines due to lack of statistical sampling.
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 20
The search space is composed by documents.They are not restricted to any
particular repository.NLP techniques are applied to documents to extract knowledge.Hakia
uses its own ontology to store the extracted knowledge and exploit it during the search
process.
The Semantics Behind Hakia
• Ontological Semantics(OntoSem)
• Query Detection And Extraction(QDEX)
• Semantic Rank
7.1.QDEX (Query Detection And Extraction)
The underlying principle of the QDEX method is described by two questions: Given
a sentence from a Web page, how many questions can be asked so that this sentence will be a
potential answer? And what are these question sequences?A basic premise is that each
question sequence found above is already equivalent to an intersection set in a conventional
index, and therefore represents an answer node. Collecting all cases of the same sequence
from other Web pages, then limiting the list to the best answer candidates via competition
criteria completes the QDEX method.With this approach, all the questions that can be asked
by a user, and those that would find answers from analyzed Web pages, are already extracted
(anticipated) and prepared by the QDEX algorithm before the user even comes to the search
engine. This mosaic‐like data structure is depicted in Figure.
Fig.8.QDEX Method
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 21
Indexing and QDEX methods differ as shown in Fig 3, where the index table is on the left
and QDEX lists are on the right. In this example, the blue colored QDEX list corresponds to a
particular word combination (sequence) and red corresponds to another. How these word
combinations are formed is described in the next section.
The first important advantage of this approach is that the task of extracting a result from a
large index on the fly is eliminated, replaced by the task of accessing a single small list (file) that
includes results already prioritized off‐line. Thus, QDEX is a highly distributed system where
each distributed item has no functional dependence on the rest of the system at any given time.
This system results in superb flexibility of data storage and facilitates its scalability.
The second important advantage is the growth phenomenon. An inverted index grows
linearly with the introduction of every new Web page, forever. The QDEX method grows only
with the introduction of new knowledge (unique sequences). If the existing knowledge is repeated
in a newly acquired Web page, it is subject to competition based on credibility criteria. The
QDEX idea poses several challenges that were taken on by the hakia team.
Fig.9. Comparison of Indexing and QDEXing methods
Combinatory Explosion Problem
Let's assume a URL was crawled, its Web page was taken in, and each sentence
on that page was analyzed. Suppose we have the following sentence with 8 significant
words:
In Norse mythology the polar aurora represents the Ride of the Valkyries to War.
Looking at this sentence, a human might compose several queries:
aurora ,mythology ,Valkyries,Norse Mythology,polar aurora,What is aurora? ,What is
Norse?,What is Norse Mythology?,What is polar aurora?,Who were the Valkyries?,In what
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 22
mythology is aurora represented?,Ride of the Valkyries to War,What does aurora represent in
mythology? …etc .In fact, there are about 50 questions/queries that someone could ask against
this sentence. Eliminating the noise words from the list above and accounting for repetition
yields 35 sequences of words comprising around 120 total words. Compared to the 8 words
and their positions an indexing system would store, we have a much larger storage
requirement here (about 8 times) for this particular example.
Each sequence in QDEX, however, will stand alone like an answer node, and will
repeat in other places, so the total storage can reduce down from 8 times to a smaller fraction.
For example, the sequence "Norse Mythology" will have its own competing list, and will
contain occurrences in other documents, thus it will not increase the counter linearly.
Therefore, all the repeating sequences (the permutations of these words that are statistically
meaningful) will be retained in the system, and eventually the storage requirement will be
manageable.The problem is how to create the 35 sequences from the given sentence. The
combinatory explosion phenomenon starts here.Mathematically speaking, if we take 8 words,
and try all permutations, the total number of sequences will be around 500,000. The question
is how to reduce 500,000 sequences to 35 meaningful ones just as the human brain does.To
any linguist, it would be clear that employing grammar or syntactic algorithms would not
even get close to solving the problem. It has to be a meaning‐based approach either in its
full ‐scale deployment such as OntoSem, or its approximated versions using fuzzy logic.
Breeding Sequences
Given a sentence such as the one above, the generation of sequences by a computer
algorithm that mimics human‐like sequence‐generation is called "breeding" at hakia.
Fig.10. QDEX Sequence Breeding versus the human brain
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 23
QDEX sequence breeding follows a process analogous to the how the human brain learns. the
human brain understands the meaning of a text and stores a reasonable number of knowledge
bits, which we call “answer nodes.”The sequence breeder of hakia comes up with a
comparable number of meaningful sequences (answer nodes) without getting lost in the
permutation space.
To accomplish the first stage approximation to breeding, hakia has deployed a
fuzzy logic solution. This approach uses a simple word identification method called
"bag‐of‐words"(BOW), and uses fuzzy rules [5] to model the basic principles of meaning.
hakia's current breeder can estimate up to 80% of the human‐like sequences with an
overshoot rate between 7 and 15. During the QDEX process, the overshooting sequences that
are not very useful eventually drop out from the system if they do not repeat somewhere
else.Deletion of QDEX sequences resembles the "forgetting" process in the human brain
theoretically happening during sleep via the dreaming process. Regardless of the validity of
this theory, we call this process "QDEX dreaming"
The second stage of breeding involves OntoSem, and has two primary functions:
(1) validate the meaningfulness of QDEX sequences, (2) create equivalent forms of the same
sequences using different words.
Fig.11. Refinement of the QDEX sequences
The two primary functions, Breeding and Ontology, represent the critical difference between a
meaning‐based search engine and its index‐based cousins. Hakia’s intricate breeding process is a
trade secret.
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 24
QDEX Files
Each QDEX sequence, like "Norse Mythology," is stored in a file that contains the
paragraph ID of the originating source. Since each QDEX file represents an answer node, the
entries inside are limited to a reasonable number. If the analysis of a new document results in the
same QDEX sequence, then the paragraph ID is inserted in the QDEX file in a place based on
several "credibility" and "quality" criteria. These criteria are also hakia's trade secret. If the file is
full, then the lowest position entry drops out to make a room for another entry that is deemed to
be better. Consequently, you end up with a QDEX file for Norse Mythology that contains a best
set of paragraph IDs refined from millions of documents.
QDEX Storage
QDEX’s unusual storage problem was solved by a unique approach. First, QDEX
files are named by alphabetizing the words of the QDEX sequence. Second, the files are
stored in a vast array of servers via hash+mode coding, which ensures even distribution and
immediate destination coding based on file name.Because the QDEX method is data I/O
intensive, a special three way redundancy architecture was designed, called KUTS. The
KUTS system has highly distributed back‐end storage devices that constantly write new data
on disks. A group of storage devices are designated to serve only, and not to interfere with
the disk‐write process. At scheduled intervals, the back‐end is image‐copied to the service
storage devices.
7.2.Hakia Ontosem
“Ontological Semantics (OntoSem) is a formal and comprehensive linguistic theory of
meaning in natural language. As such, it bears significantly on philosophy of language,
mathematical logic, and cognitive science. It is a rapidly growing area of intense academic
research and of active practical implementations, of which hakia.com is by far the leading
one.OntoSem offers an advanced methodology and technology for natural language
processing, the only one of its kind, so far, to access the full meaning of the text it handles. As
such, it is also a set of well-developed and constantly improving resources, including a
language-independent ontology of thousands of interrelated concepts; an ontology-based
English lexicon of 100,000 word senses, and counting (plus, the lexicons for several other
languages under construction); and an ontological parser which ‘translates’ every sentence of
the text into its text meaning representation, approximating the complete understanding of the
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 25
sentence by the native speaker.” “hakia OntoSem is a modular, extensible, and adaptable
toolset for government, business, education and research applications to enable developers to
use the meaning of language and not just text string matching for applications, including:
• information retrieval, analysis, and distribution
• text summarization
• information assurance and security
• machine translation
• ontology support
• terminology standardization
• supply chain automation
7.3. Semantic Rank Algorithm
Having solved retrieval scalability via single file request, hakia adds a layer for an
on‐the‐fly analysis called SemanticRank which is an independent module .SemanticRank
analysis refines the results by locating the best sentences in the paragraph that match the
query. This process uses syntactic, ontological, and morphological solutions.
Fig.12. SemanticRank Algorithm
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 26
A typical example is shown below. The query was "why did Enron collapse?"
It is immediately visible that a typical result coming from hakia is focused on a
sentence (highlighted) by on‐the‐fly NLP process that selects the best sentence. The result is
maintained as one uninterrupted piece of text. The same query to Google brings results in the
following format:
In Google, there is no sentence selection due to lack of semantics, and the text
snippet is often interrupted by an ellipsis.
Among the many capabilities of the SemanticRank algorithm, a key feature is the
query type detection. For example, if the query is a “why” question, then the SemanticRank
algorithm can formulate a special QDEX file request that would add sense information such as
"reason, cause" Accordingly, the question "why did Enron collapse?" is converted into a QDEX
file request that has the correct sense. Hakia's SemanticRank algorithm can identify more than 60
different question types, almost all the possible types in the English language. Question type (and
sense) detection also allows requesting more than a single QDEX file (fall‐backs) to retrieve
more than one answer nodes both of which are deemed applicable.
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 27
Conclusion
Ontology based IR is a growing technology,which relies on semantics. Nowadays,the
ontology development is not an open source activity.Hakia intends to release the QDEX
system in an open‐source agreement in the future. The objective is to encourage scientists and
developers to improve the QDEX method in their own way . Ontologies with semantics is
used for the development of sematic web technology.Though it has many practical challenges
it will be completely developed in the coming years.
Ontology Based Information Retrieval
Department of Computer Science, CUSAT 28
References
[1]. David Vallet, Miriam Fernández, and Pablo Castells- An Ontology-Based Information
Retrieval Model-http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.4633&rep
[2]. Natalya F. Noy and Deborah L. McGuinness-A Guide to Creating Your First
Ontologyhttp://protege.stanford.edu/publications/ontology_development/ontology101-noy-
mcguinness.html
[3].M.Sc. Nenad Stojanovic-Ontology-based Information Retrieval:Methods and Tools for
Cooperative Query Answering- White paper_ semantic _search_ technology.pdf