Download - Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin [email protected] belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 1

Information Retrieval: A Quick Overview

Nicholas J. [email protected]

http://scils.rutgers.edu/~belkin/belkin.html


The IR Situation

• A person (the user) recognizes that her/his knowledge is inadequate for resolving some problem / achieving some goal (a problematic situation)

• In order to resolve the problematic situation, the user has recourse to some knowledge resource external to her/himself


The IR Situation (2)

• The user engages with the knowledge resource through some intermediary

• The three components, user, knowledge resource, intermediary, and their interactions with one another, together constitute the information retrieval system


IR Systems

• The goal of an IR system is that the user’s problematic situation is appropriately resolved

• This goal is accomplished by facilitating effective interaction of the user with appropriate information objects (elements of the knowledge resource)


Relevance

• An indicator, or measure, of the appropriateness of an information object to a user’s problematic situation

• Topical relevance - The information object is about the same topic as the problematic situation

• Situational relevance - The information object is useful in resolving the problematic situation


What IR Systems Try to Do

• Predict, on the basis of some information about the user, and information about the knowledge resource, what information objects are likely to be the most appropriate for the user to interact with, at any particular time


How IR Systems Try to Do This• Represent the user’s information problem

(the query)

• Represent (surrogate) and organize (classify) the contents of the knowledge resource

• Compare query to surrogates (predict relevance)

• Present results to the user for interaction/judgment


How IR Differs from DBMS

• No “right” answer

• Probabilistic (predictive), not determinative

• Unstructured, or only partially structured information (e.g. text, images)


Why IR is Difficult

• People cannot specify what they don’t know (Anomalous State of Knowledge), so representation of information problem is inherently uncertain

• Information objects can be about many things, so representation of aboutness is inherently incomplete


Why IR is Difficult (2)

• Relevance is a relation between the person and the information object(s), and is dependent upon user’s interpretation, so prediction of relevance (or appropriateness) is inherently uncertain


Evaluation of IR Systems

• Traditional goal of IR is to retrieve all and only the relevant IOs in response to a query

• All is measured by recall: the proportion of relevant IOs in the collection which are retrieved

• Only is measured by precision: the proportion of retrieved IOs which are relevant


Other Functions of IR Systems

• IR is concerned not only with supporting “specified searching”

• People engage in many kinds of interactions with IR systems, e.g. “browsing”, “evaluating”, “comparing”, “extracting”

• People have many different IR-related tasks, e.g. question-answering, finding one or a few “good” IOs, constructing a “useful” portal


Other Evaluation Measures

• To evaluate IR support for different tasks, different measures are required

• Relevance may not be the only criterion according to which measures are constructed

• Support for different kinds of behaviors may require different kinds of measures


Evaluation of What?

• Effectiveness– recall, precision, accuracy of answer,

“satisfaction”

• Usability– learnability error rates

• Performance– time, cognitive effort


Evaluation Problems• Realistic IR is interactive; traditional IR methods

and measures are based on non-interactive situations

• Evaluating interactive IR requires human subjects; the normal mode of evaluation is comparison between two systems (no gold standard or benchmarks); cannot compare a subject’s searching on the same task in two systems

• Major tradeoffs between number of subjects and number of tasks; realism and control


USER PROBLEM

TEXTS

REPRESENTATION REPRESENTATION

QUERY

SURROGATES

COMPARISON

RESULTS

JUDGMENT

END

MODIFICATION

A Traditional View of IR (you’ll see this again)


IR as Support for Interaction with Information

USER

COMPARISON REPRESENTATION

PRESENTATION

VISUALIZATION

goals, tasks,knowledge,problem, uses

INTERACTIONjudgment,use, search,interpretation,modification

INFORMATIONtype, medium,mode, level

NAVIGATION

USER


PRESENTATION

VISUALIZATION




NAVIGATION

USER


PRESENTATION

VISUALIZATION




NAVIGATION

Time

Overall goals, environment, situation


The User as the Central Actor in the IR System

• The goal of IR is to help the user resolve the problematic situation

• This is done by supporting interaction with appropriate IOs

• The user in the system is the only actor that can judge appropriateness

• The user’s interactions determine the type of support provided


Interaction as the Central Process of IR

• Accepting the user as the central actor implies accepting the user’s interactions with information as the central process

• All other IR processes can be interpreted as being in support of the user’s current (or future) interactions with information

• This suggests specific IR system design choices and problems


How Interaction Has Been Accounted For

• Relevance feedback– Automatically moving the initial query toward

the “ideal” query– Term reweighting and query expansion

• Support for query modification– Display of “good” and “bad” terms– Thesauri– Inter-document relations


Personalization in IR

• Taking account of user goals, situation, context for– tailoring the interaction– tailoring the retrieval results

• TREC HARD track is a first attempt at evaluating use of context


IR Models

• Exact match models– String matching– Boolean

• Best (partial match) models– Vector space– Probabilistic– Logic (Plausible inference)– Language modeling


Exact Match IR

• Goal of EM IR is to retrieve the set of information objects which match the user’s query specification

• Assumptions of EM IR– IOs are completely representable– Information problems are specifiable– Relevance is determinable and binary


Exact Match IR

• Retrieves IOs that contain specified string or Boolean combination of strings

• Supported by inverted file organization (or signatures)

• Enhanced by wild-cards, proximity searching


Exact Match IR

• Advantages– Efficient– Boolean queries capture some aspects of

information problem structure

• Disadvantages– Not effective– Difficult to write effective queries– No inherent document ranking


Best Match IR

• All types based on the assumption that IR is an uncertain process

• Models differ by what they ascribe the uncertainty to, and by how they respond to that uncertainty


Vector Space IR

• Words represent concepts or topics

• These can be construed as dimensions of a “concept space”

• IOs are about the topics represented by their words

• IOs can be represented as vectors in the concept space

• Queries can be specified and represented as are IOs


Vector Space IR

• Goal of IR is to present the user with IOs most similar to query, in order of similarity

• Similarity is defined as closeness in the concept (vector) space

• Uncertainty in IR is in the degree of match between IO and query, arises from uncertainty in representation of each


Vector Space Model

• Advantages– Straightforward ranking– Simple query formulation (bag of words)– Intuitively appealing– Effective

• Disadvantages– Unstructured queries– Effective calculations and parameters must be

empirically determined


Probabilistic Model

• Uncertainty in IR arises from uncertainty in the relevance relationship, in the representation of the information problem, and in the representation of IOs

• Result of these uncertainties can be represented as probabilities of relevance of an IO to an information problem, given the available evidence


Probabilistic IR

• Goal of IR is to present to the user the IOs in order of their probability of relevance to the information problem (the Probability Ranking Principle)


Probabilistic IR• Advantages

– Straightforward relevance ranking– Simple query formulation– Sound mathematical/theoretical model– Effective

• Disadvantages– Unrealistic assumptions (term independence)– Probabilities difficult to estimate


Plausible Inference IR

• Uncertainty in IR arises from uncertainty in relevance relationship, uncertainty in representation of information problem, uncertainty in representation of IOs

• This implies that IR can be no more than a process of plausible inference of relevance of an IO to an information problem



• In logical implicature version, IO and information problem should be represented in a logical formalism which allows plausible inference

• In multiple sources of evidence version, as much evidence as possible about relationship between IO and information problem should be used to estimate probability of relevance (induction)



• In logic version, goal of IR is to present to the user those IOs from which the query is most plausibly inferred, in order of plausibility

• In sources of evidence version, goal of IR is to present to the user those IOs which are believed most likely to be relevant, in the order of strength of belief



• Advantages– Relevance ranking– Strong formalisms– Structured queries possible– Effective (multiple sources of evidence)

• Disadvantages– Complex, difficult to implement– Weight for evidence empirically determined


Language Modeling for IR

• Assumes that IOs and expressions of information problems are of the same type

• Uncertainty in IR is due to uncertainty in representations of IOs and information problems

• Goal is to present to the user IOs in order of the probability of the IO being generated by the language model of the information problem (or vice versa), or by the similarity of the language model of the IO to that of the information problem


Language Modeling for IR

• Most common type is statistical unigram model, based on observed word frequencies, smoothed in various ways

• The Kullback-Leibler distance is a measure of the distance between two probability distributions

KL({pi},{qi}) = pi(log2(pi/qi))

i


Advantages of Language Modeling

• Attempts to do away with the concept of relevance

• Computationally tractable, intuitively appealing


Problems with Language Modeling

• Assumption of equivalence between IO and information problem representation is unrealistic

• Very simple models of language

• Choosing a method of smoothing is difficult, and in general, ad hoc


Problems in Best Match IR

• For most best match IR models to work well, queries should be long– bag of words approach depends upon many

words in order to disambiguate meaning

• Reasons for retrieval and ranking are not easily understood


Overcoming Problems in Best Match IR

• Enhance short queries through query expansion based on pseudo-relevance feedback or other methods

• Default exact match searching for short queries

• Encourage longer queries/problem statements through interface design


Some Takeaway Messages

• IR supports a human activity• IR is inherently interactive, and the IR system

inevitably involves the user as the central actor• Representation and comparison techniques for

text-based IR seem to have plateaued• Improved IR will come from improved support for

all types of interactions with information, and especially with personalization

• Big research issue: how to represent and use situation and context

Download - Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin [email protected] belkin/belkin.html

Top Related