jhu-hlt-2004 © n.j. belkin 1
Information Retrieval: A Quick Overview
Nicholas J. [email protected]
http://scils.rutgers.edu/~belkin/belkin.html
jhu-hlt-2004 © n.j. belkin 2
The IR Situation
• A person (the user) recognizes that her/his knowledge is inadequate for resolving some problem / achieving some goal (a problematic situation)
• In order to resolve the problematic situation, the user has recourse to some knowledge resource external to her/himself
jhu-hlt-2004 © n.j. belkin 3
The IR Situation (2)
• The user engages with the knowledge resource through some intermediary
• The three components, user, knowledge resource, intermediary, and their interactions with one another, together constitute the information retrieval system
jhu-hlt-2004 © n.j. belkin 4
IR Systems
• The goal of an IR system is that the user’s problematic situation is appropriately resolved
• This goal is accomplished by facilitating effective interaction of the user with appropriate information objects (elements of the knowledge resource)
jhu-hlt-2004 © n.j. belkin 5
Relevance
• An indicator, or measure, of the appropriateness of an information object to a user’s problematic situation
• Topical relevance - The information object is about the same topic as the problematic situation
• Situational relevance - The information object is useful in resolving the problematic situation
jhu-hlt-2004 © n.j. belkin 6
What IR Systems Try to Do
• Predict, on the basis of some information about the user, and information about the knowledge resource, what information objects are likely to be the most appropriate for the user to interact with, at any particular time
jhu-hlt-2004 © n.j. belkin 7
How IR Systems Try to Do This• Represent the user’s information problem
(the query)
• Represent (surrogate) and organize (classify) the contents of the knowledge resource
• Compare query to surrogates (predict relevance)
• Present results to the user for interaction/judgment
jhu-hlt-2004 © n.j. belkin 8
How IR Differs from DBMS
• No “right” answer
• Probabilistic (predictive), not determinative
• Unstructured, or only partially structured information (e.g. text, images)
jhu-hlt-2004 © n.j. belkin 9
Why IR is Difficult
• People cannot specify what they don’t know (Anomalous State of Knowledge), so representation of information problem is inherently uncertain
• Information objects can be about many things, so representation of aboutness is inherently incomplete
jhu-hlt-2004 © n.j. belkin 10
Why IR is Difficult (2)
• Relevance is a relation between the person and the information object(s), and is dependent upon user’s interpretation, so prediction of relevance (or appropriateness) is inherently uncertain
jhu-hlt-2004 © n.j. belkin 11
Evaluation of IR Systems
• Traditional goal of IR is to retrieve all and only the relevant IOs in response to a query
• All is measured by recall: the proportion of relevant IOs in the collection which are retrieved
• Only is measured by precision: the proportion of retrieved IOs which are relevant
jhu-hlt-2004 © n.j. belkin 12
Other Functions of IR Systems
• IR is concerned not only with supporting “specified searching”
• People engage in many kinds of interactions with IR systems, e.g. “browsing”, “evaluating”, “comparing”, “extracting”
• People have many different IR-related tasks, e.g. question-answering, finding one or a few “good” IOs, constructing a “useful” portal
jhu-hlt-2004 © n.j. belkin 13
Other Evaluation Measures
• To evaluate IR support for different tasks, different measures are required
• Relevance may not be the only criterion according to which measures are constructed
• Support for different kinds of behaviors may require different kinds of measures
jhu-hlt-2004 © n.j. belkin 14
Evaluation of What?
• Effectiveness– recall, precision, accuracy of answer,
“satisfaction”
• Usability– learnability error rates
• Performance– time, cognitive effort
jhu-hlt-2004 © n.j. belkin 15
Evaluation Problems• Realistic IR is interactive; traditional IR methods
and measures are based on non-interactive situations
• Evaluating interactive IR requires human subjects; the normal mode of evaluation is comparison between two systems (no gold standard or benchmarks); cannot compare a subject’s searching on the same task in two systems
• Major tradeoffs between number of subjects and number of tasks; realism and control
jhu-hlt-2004 © n.j. belkin 16
USER PROBLEM
TEXTS
REPRESENTATION REPRESENTATION
QUERY
SURROGATES
COMPARISON
RESULTS
JUDGMENT
END
MODIFICATION
A Traditional View of IR (you’ll see this again)
jhu-hlt-2004 © n.j. belkin 17
IR as Support for Interaction with Information
USER
COMPARISON REPRESENTATION
PRESENTATION
VISUALIZATION
goals, tasks,knowledge,problem, uses
INTERACTIONjudgment,use, search,interpretation,modification
INFORMATIONtype, medium,mode, level
NAVIGATION
USER
COMPARISON REPRESENTATION
PRESENTATION
VISUALIZATION
goals, tasks,knowledge,problem, uses
INTERACTIONjudgment,use, search,interpretation,modification
INFORMATIONtype, medium,mode, level
NAVIGATION
USER
COMPARISON REPRESENTATION
PRESENTATION
VISUALIZATION
goals, tasks,knowledge,problem, uses
INTERACTIONjudgment,use, search,interpretation,modification
INFORMATIONtype, medium,mode, level
NAVIGATION
Time
Overall goals, environment, situation
jhu-hlt-2004 © n.j. belkin 18
The User as the Central Actor in the IR System
• The goal of IR is to help the user resolve the problematic situation
• This is done by supporting interaction with appropriate IOs
• The user in the system is the only actor that can judge appropriateness
• The user’s interactions determine the type of support provided
jhu-hlt-2004 © n.j. belkin 19
Interaction as the Central Process of IR
• Accepting the user as the central actor implies accepting the user’s interactions with information as the central process
• All other IR processes can be interpreted as being in support of the user’s current (or future) interactions with information
• This suggests specific IR system design choices and problems
jhu-hlt-2004 © n.j. belkin 20
How Interaction Has Been Accounted For
• Relevance feedback– Automatically moving the initial query toward
the “ideal” query– Term reweighting and query expansion
• Support for query modification– Display of “good” and “bad” terms– Thesauri– Inter-document relations
jhu-hlt-2004 © n.j. belkin 21
Personalization in IR
• Taking account of user goals, situation, context for– tailoring the interaction– tailoring the retrieval results
• TREC HARD track is a first attempt at evaluating use of context
jhu-hlt-2004 © n.j. belkin 22
IR Models
• Exact match models– String matching– Boolean
• Best (partial match) models– Vector space– Probabilistic– Logic (Plausible inference)– Language modeling
jhu-hlt-2004 © n.j. belkin 23
Exact Match IR
• Goal of EM IR is to retrieve the set of information objects which match the user’s query specification
• Assumptions of EM IR– IOs are completely representable– Information problems are specifiable– Relevance is determinable and binary
jhu-hlt-2004 © n.j. belkin 24
Exact Match IR
• Retrieves IOs that contain specified string or Boolean combination of strings
• Supported by inverted file organization (or signatures)
• Enhanced by wild-cards, proximity searching
jhu-hlt-2004 © n.j. belkin 25
Exact Match IR
• Advantages– Efficient– Boolean queries capture some aspects of
information problem structure
• Disadvantages– Not effective– Difficult to write effective queries– No inherent document ranking
jhu-hlt-2004 © n.j. belkin 26
Best Match IR
• All types based on the assumption that IR is an uncertain process
• Models differ by what they ascribe the uncertainty to, and by how they respond to that uncertainty
jhu-hlt-2004 © n.j. belkin 27
Vector Space IR
• Words represent concepts or topics
• These can be construed as dimensions of a “concept space”
• IOs are about the topics represented by their words
• IOs can be represented as vectors in the concept space
• Queries can be specified and represented as are IOs
jhu-hlt-2004 © n.j. belkin 28
Vector Space IR
• Goal of IR is to present the user with IOs most similar to query, in order of similarity
• Similarity is defined as closeness in the concept (vector) space
• Uncertainty in IR is in the degree of match between IO and query, arises from uncertainty in representation of each
jhu-hlt-2004 © n.j. belkin 29
Vector Space Model
• Advantages– Straightforward ranking– Simple query formulation (bag of words)– Intuitively appealing– Effective
• Disadvantages– Unstructured queries– Effective calculations and parameters must be
empirically determined
jhu-hlt-2004 © n.j. belkin 30
Probabilistic Model
• Uncertainty in IR arises from uncertainty in the relevance relationship, in the representation of the information problem, and in the representation of IOs
• Result of these uncertainties can be represented as probabilities of relevance of an IO to an information problem, given the available evidence
jhu-hlt-2004 © n.j. belkin 31
Probabilistic IR
• Goal of IR is to present to the user the IOs in order of their probability of relevance to the information problem (the Probability Ranking Principle)
jhu-hlt-2004 © n.j. belkin 32
Probabilistic IR• Advantages
– Straightforward relevance ranking– Simple query formulation– Sound mathematical/theoretical model– Effective
• Disadvantages– Unrealistic assumptions (term independence)– Probabilities difficult to estimate
jhu-hlt-2004 © n.j. belkin 33
Plausible Inference IR
• Uncertainty in IR arises from uncertainty in relevance relationship, uncertainty in representation of information problem, uncertainty in representation of IOs
• This implies that IR can be no more than a process of plausible inference of relevance of an IO to an information problem
jhu-hlt-2004 © n.j. belkin 34
Plausible Inference IR
• In logical implicature version, IO and information problem should be represented in a logical formalism which allows plausible inference
• In multiple sources of evidence version, as much evidence as possible about relationship between IO and information problem should be used to estimate probability of relevance (induction)
jhu-hlt-2004 © n.j. belkin 35
Plausible Inference IR
• In logic version, goal of IR is to present to the user those IOs from which the query is most plausibly inferred, in order of plausibility
• In sources of evidence version, goal of IR is to present to the user those IOs which are believed most likely to be relevant, in the order of strength of belief
jhu-hlt-2004 © n.j. belkin 36
Plausible Inference IR
• Advantages– Relevance ranking– Strong formalisms– Structured queries possible– Effective (multiple sources of evidence)
• Disadvantages– Complex, difficult to implement– Weight for evidence empirically determined
jhu-hlt-2004 © n.j. belkin 37
Language Modeling for IR
• Assumes that IOs and expressions of information problems are of the same type
• Uncertainty in IR is due to uncertainty in representations of IOs and information problems
• Goal is to present to the user IOs in order of the probability of the IO being generated by the language model of the information problem (or vice versa), or by the similarity of the language model of the IO to that of the information problem
jhu-hlt-2004 © n.j. belkin 38
Language Modeling for IR
• Most common type is statistical unigram model, based on observed word frequencies, smoothed in various ways
• The Kullback-Leibler distance is a measure of the distance between two probability distributions
KL({pi},{qi}) = pi(log2(pi/qi))
i
jhu-hlt-2004 © n.j. belkin 39
Advantages of Language Modeling
• Attempts to do away with the concept of relevance
• Computationally tractable, intuitively appealing
jhu-hlt-2004 © n.j. belkin 40
Problems with Language Modeling
• Assumption of equivalence between IO and information problem representation is unrealistic
• Very simple models of language
• Choosing a method of smoothing is difficult, and in general, ad hoc
jhu-hlt-2004 © n.j. belkin 41
Problems in Best Match IR
• For most best match IR models to work well, queries should be long– bag of words approach depends upon many
words in order to disambiguate meaning
• Reasons for retrieval and ranking are not easily understood
jhu-hlt-2004 © n.j. belkin 42
Overcoming Problems in Best Match IR
• Enhance short queries through query expansion based on pseudo-relevance feedback or other methods
• Default exact match searching for short queries
• Encourage longer queries/problem statements through interface design
jhu-hlt-2004 © n.j. belkin 43
Some Takeaway Messages
• IR supports a human activity• IR is inherently interactive, and the IR system
inevitably involves the user as the central actor• Representation and comparison techniques for
text-based IR seem to have plateaued• Improved IR will come from improved support for
all types of interactions with information, and especially with personalization
• Big research issue: how to represent and use situation and context