jhu-hlt-2004 © n.j. belkin 1 information retrieval: a quick overview nicholas j. belkin...

43
jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin [email protected] http://scils.rutgers.edu/~bel kin/belkin.html

Upload: martina-norton

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 1

Information Retrieval: A Quick Overview

Nicholas J. [email protected]

http://scils.rutgers.edu/~belkin/belkin.html

Page 2: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 2

The IR Situation

• A person (the user) recognizes that her/his knowledge is inadequate for resolving some problem / achieving some goal (a problematic situation)

• In order to resolve the problematic situation, the user has recourse to some knowledge resource external to her/himself

Page 3: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 3

The IR Situation (2)

• The user engages with the knowledge resource through some intermediary

• The three components, user, knowledge resource, intermediary, and their interactions with one another, together constitute the information retrieval system

Page 4: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 4

IR Systems

• The goal of an IR system is that the user’s problematic situation is appropriately resolved

• This goal is accomplished by facilitating effective interaction of the user with appropriate information objects (elements of the knowledge resource)

Page 5: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 5

Relevance

• An indicator, or measure, of the appropriateness of an information object to a user’s problematic situation

• Topical relevance - The information object is about the same topic as the problematic situation

• Situational relevance - The information object is useful in resolving the problematic situation

Page 6: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 6

What IR Systems Try to Do

• Predict, on the basis of some information about the user, and information about the knowledge resource, what information objects are likely to be the most appropriate for the user to interact with, at any particular time

Page 7: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 7

How IR Systems Try to Do This• Represent the user’s information problem

(the query)

• Represent (surrogate) and organize (classify) the contents of the knowledge resource

• Compare query to surrogates (predict relevance)

• Present results to the user for interaction/judgment

Page 8: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 8

How IR Differs from DBMS

• No “right” answer

• Probabilistic (predictive), not determinative

• Unstructured, or only partially structured information (e.g. text, images)

Page 9: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 9

Why IR is Difficult

• People cannot specify what they don’t know (Anomalous State of Knowledge), so representation of information problem is inherently uncertain

• Information objects can be about many things, so representation of aboutness is inherently incomplete

Page 10: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 10

Why IR is Difficult (2)

• Relevance is a relation between the person and the information object(s), and is dependent upon user’s interpretation, so prediction of relevance (or appropriateness) is inherently uncertain

Page 11: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 11

Evaluation of IR Systems

• Traditional goal of IR is to retrieve all and only the relevant IOs in response to a query

• All is measured by recall: the proportion of relevant IOs in the collection which are retrieved

• Only is measured by precision: the proportion of retrieved IOs which are relevant

Page 12: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 12

Other Functions of IR Systems

• IR is concerned not only with supporting “specified searching”

• People engage in many kinds of interactions with IR systems, e.g. “browsing”, “evaluating”, “comparing”, “extracting”

• People have many different IR-related tasks, e.g. question-answering, finding one or a few “good” IOs, constructing a “useful” portal

Page 13: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 13

Other Evaluation Measures

• To evaluate IR support for different tasks, different measures are required

• Relevance may not be the only criterion according to which measures are constructed

• Support for different kinds of behaviors may require different kinds of measures

Page 14: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 14

Evaluation of What?

• Effectiveness– recall, precision, accuracy of answer,

“satisfaction”

• Usability– learnability error rates

• Performance– time, cognitive effort

Page 15: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 15

Evaluation Problems• Realistic IR is interactive; traditional IR methods

and measures are based on non-interactive situations

• Evaluating interactive IR requires human subjects; the normal mode of evaluation is comparison between two systems (no gold standard or benchmarks); cannot compare a subject’s searching on the same task in two systems

• Major tradeoffs between number of subjects and number of tasks; realism and control

Page 16: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 16

USER PROBLEM

TEXTS

REPRESENTATION REPRESENTATION

QUERY

SURROGATES

COMPARISON

RESULTS

JUDGMENT

END

MODIFICATION

A Traditional View of IR (you’ll see this again)

Page 17: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 17

IR as Support for Interaction with Information

USER

COMPARISON REPRESENTATION

PRESENTATION

VISUALIZATION

goals, tasks,knowledge,problem, uses

INTERACTIONjudgment,use, search,interpretation,modification

INFORMATIONtype, medium,mode, level

NAVIGATION

USER

COMPARISON REPRESENTATION

PRESENTATION

VISUALIZATION

goals, tasks,knowledge,problem, uses

INTERACTIONjudgment,use, search,interpretation,modification

INFORMATIONtype, medium,mode, level

NAVIGATION

USER

COMPARISON REPRESENTATION

PRESENTATION

VISUALIZATION

goals, tasks,knowledge,problem, uses

INTERACTIONjudgment,use, search,interpretation,modification

INFORMATIONtype, medium,mode, level

NAVIGATION

Time

Overall goals, environment, situation

Page 18: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 18

The User as the Central Actor in the IR System

• The goal of IR is to help the user resolve the problematic situation

• This is done by supporting interaction with appropriate IOs

• The user in the system is the only actor that can judge appropriateness

• The user’s interactions determine the type of support provided

Page 19: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 19

Interaction as the Central Process of IR

• Accepting the user as the central actor implies accepting the user’s interactions with information as the central process

• All other IR processes can be interpreted as being in support of the user’s current (or future) interactions with information

• This suggests specific IR system design choices and problems

Page 20: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 20

How Interaction Has Been Accounted For

• Relevance feedback– Automatically moving the initial query toward

the “ideal” query– Term reweighting and query expansion

• Support for query modification– Display of “good” and “bad” terms– Thesauri– Inter-document relations

Page 21: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 21

Personalization in IR

• Taking account of user goals, situation, context for– tailoring the interaction– tailoring the retrieval results

• TREC HARD track is a first attempt at evaluating use of context

Page 22: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 22

IR Models

• Exact match models– String matching– Boolean

• Best (partial match) models– Vector space– Probabilistic– Logic (Plausible inference)– Language modeling

Page 23: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 23

Exact Match IR

• Goal of EM IR is to retrieve the set of information objects which match the user’s query specification

• Assumptions of EM IR– IOs are completely representable– Information problems are specifiable– Relevance is determinable and binary

Page 24: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 24

Exact Match IR

• Retrieves IOs that contain specified string or Boolean combination of strings

• Supported by inverted file organization (or signatures)

• Enhanced by wild-cards, proximity searching

Page 25: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 25

Exact Match IR

• Advantages– Efficient– Boolean queries capture some aspects of

information problem structure

• Disadvantages– Not effective– Difficult to write effective queries– No inherent document ranking

Page 26: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 26

Best Match IR

• All types based on the assumption that IR is an uncertain process

• Models differ by what they ascribe the uncertainty to, and by how they respond to that uncertainty

Page 27: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 27

Vector Space IR

• Words represent concepts or topics

• These can be construed as dimensions of a “concept space”

• IOs are about the topics represented by their words

• IOs can be represented as vectors in the concept space

• Queries can be specified and represented as are IOs

Page 28: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 28

Vector Space IR

• Goal of IR is to present the user with IOs most similar to query, in order of similarity

• Similarity is defined as closeness in the concept (vector) space

• Uncertainty in IR is in the degree of match between IO and query, arises from uncertainty in representation of each

Page 29: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 29

Vector Space Model

• Advantages– Straightforward ranking– Simple query formulation (bag of words)– Intuitively appealing– Effective

• Disadvantages– Unstructured queries– Effective calculations and parameters must be

empirically determined

Page 30: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 30

Probabilistic Model

• Uncertainty in IR arises from uncertainty in the relevance relationship, in the representation of the information problem, and in the representation of IOs

• Result of these uncertainties can be represented as probabilities of relevance of an IO to an information problem, given the available evidence

Page 31: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 31

Probabilistic IR

• Goal of IR is to present to the user the IOs in order of their probability of relevance to the information problem (the Probability Ranking Principle)

Page 32: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 32

Probabilistic IR• Advantages

– Straightforward relevance ranking– Simple query formulation– Sound mathematical/theoretical model– Effective

• Disadvantages– Unrealistic assumptions (term independence)– Probabilities difficult to estimate

Page 33: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 33

Plausible Inference IR

• Uncertainty in IR arises from uncertainty in relevance relationship, uncertainty in representation of information problem, uncertainty in representation of IOs

• This implies that IR can be no more than a process of plausible inference of relevance of an IO to an information problem

Page 34: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 34

Plausible Inference IR

• In logical implicature version, IO and information problem should be represented in a logical formalism which allows plausible inference

• In multiple sources of evidence version, as much evidence as possible about relationship between IO and information problem should be used to estimate probability of relevance (induction)

Page 35: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 35

Plausible Inference IR

• In logic version, goal of IR is to present to the user those IOs from which the query is most plausibly inferred, in order of plausibility

• In sources of evidence version, goal of IR is to present to the user those IOs which are believed most likely to be relevant, in the order of strength of belief

Page 36: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 36

Plausible Inference IR

• Advantages– Relevance ranking– Strong formalisms– Structured queries possible– Effective (multiple sources of evidence)

• Disadvantages– Complex, difficult to implement– Weight for evidence empirically determined

Page 37: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 37

Language Modeling for IR

• Assumes that IOs and expressions of information problems are of the same type

• Uncertainty in IR is due to uncertainty in representations of IOs and information problems

• Goal is to present to the user IOs in order of the probability of the IO being generated by the language model of the information problem (or vice versa), or by the similarity of the language model of the IO to that of the information problem

Page 38: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 38

Language Modeling for IR

• Most common type is statistical unigram model, based on observed word frequencies, smoothed in various ways

• The Kullback-Leibler distance is a measure of the distance between two probability distributions

KL({pi},{qi}) = pi(log2(pi/qi))

i

Page 39: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 39

Advantages of Language Modeling

• Attempts to do away with the concept of relevance

• Computationally tractable, intuitively appealing

Page 40: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 40

Problems with Language Modeling

• Assumption of equivalence between IO and information problem representation is unrealistic

• Very simple models of language

• Choosing a method of smoothing is difficult, and in general, ad hoc

Page 41: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 41

Problems in Best Match IR

• For most best match IR models to work well, queries should be long– bag of words approach depends upon many

words in order to disambiguate meaning

• Reasons for retrieval and ranking are not easily understood

Page 42: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 42

Overcoming Problems in Best Match IR

• Enhance short queries through query expansion based on pseudo-relevance feedback or other methods

• Default exact match searching for short queries

• Encourage longer queries/problem statements through interface design

Page 43: Jhu-hlt-2004 © n.j. belkin 1 Information Retrieval: A Quick Overview Nicholas J. Belkin nick@belkin.rutgers.edu belkin/belkin.html

jhu-hlt-2004 © n.j. belkin 43

Some Takeaway Messages

• IR supports a human activity• IR is inherently interactive, and the IR system

inevitably involves the user as the central actor• Representation and comparison techniques for

text-based IR seem to have plateaued• Improved IR will come from improved support for

all types of interactions with information, and especially with personalization

• Big research issue: how to represent and use situation and context