information retrieval and vector space model presented by jun miao york university

49
Information Retrieval and Vector Space Model Presented by Jun Miao York University 1

Upload: khuong

Post on 24-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Information Retrieval and Vector Space Model Presented by Jun Miao York University. Information Retrieval (IR). What is Information Retrieval?. = IR ? IR: Retrieve information which is relative to your need Search Engine Question Answering - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Information Retrieval and Vector Space Model

Presented by Jun Miao

York University

1

Page 2: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

2

Page 3: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

What is Information Retrieval?What is Information Retrieval?

= IR ?IR: Retrieve information which is relative to

your need Search Engine Question Answering Information Extraction Information Filtering Information Recommendation

3

Page 4: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

In old days…In old days…The term "information retrieval" may

have been coined by Calvin Mooers

Early IR applications are used in libraries

Set-based retrieval the system partitions the corpus into two subsets

of documents: those it considers relevant to the search query, and those it does not.

4

Page 5: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

In nowadaysIn nowadaysRanked Retrieval the system responds to a search query by

ranking all documents in the corpus based on its estimate of their relevance to the query.

◦free-form query expresses user’s information need

◦rank documents by decreasing likelihood of relevance

◦many studies prove it is superior

5

Page 6: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

An Information Retrieval An Information Retrieval Process Process (Borrow from Prof. Nie’s slides)

6

Document collection

Info. need

Query

Answer list

IR system

Retrieval

Page 7: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Inside a IR systemInside a IR system

7

Page 8: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Indexing DocumentIndexing Document

8

Page 9: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Lexical AnalysisLexical AnalysisWhat counts as a word or token

in the indexing scheme?

A big topic

9

Page 10: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Stop ListStop List function words do not bear useful

information for IRof, not, to, or, in, about, with, I, be, …

Stop list: contain stop words, not to be used as index◦ Prepositions◦ Articles◦ Pronouns◦ Some adverbs and adjectives◦ Some frequent words (e.g. document)

The removal of stop words usually improves IR effectiveness

A few “standard” stop lists are commonly used.

10

Page 11: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

StemmingStemming

11

Reason: ◦ Different word forms may bear similar meaning

(e.g. search, searching): create a “standard” representation for them

Stemming: ◦ Removing some endings of word

dancer dancers

dancedanceddancing

dance

Page 12: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Stemming(Cont’d)Stemming(Cont’d)Two main methods : Linguistic/dictionary-based

stemming high stemming accuracyhigh implementation and processing costs

and higher coverage

Porter-style stemming

lower stemming accuracylower implementation and processing costs

and lower coverageUsually sufficient for IR

12

Page 13: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Flat file indexingFlat file indexingEach document is represented by

a set of weighted keywords (terms):

D1 {(t1, w1), (t2,w2), …}

e.g. D1 {(comput, 0.2), (architect,

0.3), …}D2 {(comput, 0.1), (network,

0.5), …}

13

Page 14: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Inverted IndexInverted Index

14

Page 15: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Query AnalysisQuery AnalysisParse QueryClean StopwordsStemmingGet termsAdjacent operations

connect related terms together

15

Page 16: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

ModelsModelsMatching score model

◦Document D = a set of weighted keywords

◦Query Q = a set of non-weighted keywords

◦R(D, Q) = i w(ti , D)

where ti is in Q.

16

Page 17: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Models(Cont’d)Models(Cont’d)Boolean ModelVector Space ModelProbability ModelLanguage ModelNeural Network ModelFuzzy Set Model……

17

Page 18: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

tf*idf weighting schematf*idf weighting schematf = term frequency

◦ frequency of a term/keyword in a documentThe higher the tf, the higher the importance (weight) for the doc.

df = document frequency◦ no. of documents containing the term◦ distribution of the term

idf = inverse document frequency◦ the unevenness of term distribution in the corpus◦ the specificity of term to a document◦ Idf = log(d/df) d= total number of documentsThe more the term is distributed evenly, the less it is specific to a document

weight(t,D) = tf(t,D) * idf(t)

18

Page 19: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

EvaluationEvaluationA result list according to a query

What is its performance?

retrieved relevant

Relevant Retrieve

d

19

Page 20: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Metrics often used Metrics often used (together):(together):

Precision = retrieved relevant docs / retrieved docs

Recall = retrieved relevant docs / relevant docs

20

Page 21: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Precision-Recall Trade-offPrecision-Recall Trade-off

Precision 1.0 Recall 1.0

Usually, more precision, less recall; More recall, less precisionReturn all documents: recall rate = 1 precision is very low

21

Page 22: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

For Ranked ListFor Ranked ListConsider two result lists of two IR systems S1

and S2 according to one query:

1.

2.

Which one is better???

relevant documents

relevant documents

22

Page 23: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Average PrecisionAverage PrecisionAP = sum(R(xi)/P(xi)) / n Xi ∈ Set of retrieved relative documents

P(xi) : Rank of xi in retrieved list

R(xi) : Rank of xi in retrieved relative document list

n : Number of retrieved relative documents

List 1:

AP1 = ((1/1)+(2/3)+(3/6)+(4/9)+(5/10))/5 = 0.622

relevant documents

23

Page 24: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Average Precision (Cont’d)Average Precision (Cont’d)List 2

AP2 = ( (1/1)+(2/2)+(3/3)+(4/5)+(5/6) ) / 5 = 0.927

S2 is better than S1

relevant documents

24

Page 25: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Evaluating over multiple Evaluating over multiple queriesqueriesMean Average Precision: Arithmetic mean of average precisions over all

queries

5 Queries (Topics) and 2 IR systems

S1 is better than S2

AP1 AP2 AP3 AP4 AP5 MAP

S1 0.7 0.8 0.9 0.3 0.5 0.64

S2 0.9 0.9 0.2 0.3 0.4 0.54

25

Page 26: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Other MeasurementsOther MeasurementsPrecision@NR-PrecisionF-measurementE-measurement……

26

Page 27: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

ProblemProblemSometimes, documents in the

collections are numerous. It is hard to calculate recall rate.

27

Page 28: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

PoolingPoolingStep 1. Get top N documents

from the results of IR systems to make a document pool.

Step 2. Experts check the pool, and tag these documents by relevant or non-relevant according to different topics

28

Page 29: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Difficulties in text IRDifficulties in text IRVocabularies mismatching

◦ Synonymy: e.g. car v.s. automobile◦ Polysemy: table

Queries are ambiguous, they are partial specification of user’s need

Content representation may be inadequate and incomplete

The user is the ultimate judge, but we don’t know how the judge judges…◦ The notion of relevance is imprecise, context- and

user-dependent

29

Page 30: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Difficulties in web IRDifficulties in web IRNo stable document collection

(spider, crawler)Invalid document, duplication, etc.Huge number of documents (partial

collection)Multimedia documentsGreat variation of document qualityMultilingual problem…

30

Page 31: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

NLP in IRNLP in IRSimple methods: stop word,

stemmingHigher-level processing:

chunking, parsing, word sense disambiguation

Research about using NLP in IR needs more attention

31

Page 32: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Popular systemsPopular systemsSMART http://ftp.cs.cornell.edu/pub/smart/

Terrier http://ir.dcs.gla.ac.uk/terrier/

Okapi

http://www.soi.city.ac.uk/~andym/OKAPIPACK/index.html

Lemur http://www-2.cs.cmu.edu/~lemur/ etc…

32

Page 33: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Conference and JournalConference and JournalConference

SIGIR TREC CLEF WWW ECIR

… Journal

ACM Transactions on Information Systems(TOIS) ACM Transactions on Asian Language Information

Processing(TALIP) Information Processing & Management(IP&M) Information Retrieval

33

Page 34: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

34

Page 35: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

IdeaIdeaConvert documents and queries into

vectors, and use Similarity Coefficient(SC) to measure the similarity

Presented by Gerard Salton et al. in 1975, implemented in SMART IR system

Premise: all terms are independent 35

Page 36: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Construct VectorConstruct Vector

Each dimension corresponds to a separate term.

Wi,j = weight of term j in document or query i

36

Page 37: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Doc-Term MatrixDoc-Term MatrixN documents and M terms

D1 D2 D3 … Dn

T1 W1,1 W2,1 W3,1 … Wn,1

T2 W1,2 W2,2 W3,2 … Wn,2

T3 W1,3 W2,3 W3,3 … Wn,3

… … … … … …

Tm W1,m W2,m W3,m … Wn,m

37

Page 38: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Three Key problemsThree Key problems

1.Term selection 2.Term weighting 3.Similarity Coefficient Calculation

38

Page 39: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Term SelectionTerm SelectionTerms represent the content of

documentsTerm purification

StemmingStoplistOnly choose Nouns

39

Page 40: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Term WeightTerm WeightBoolean weight: 1: appear 0:

not appearTerm Frequency:

tf 1+log(tf) 1+(1+log(tf))

Inverse Document Frequency tf*idf

40

Page 41: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Term Weight (Cont’d)Term Weight (Cont’d)Document LengthTwo opinions:

Longer documents contain more terms

Longer documents have more information

Punish long documents and compensate to short documents

Pivoted Normalization : 1-b+b*doclen/avgdoclen b in (0,1)

41

Page 42: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Similarity Coefficient Similarity Coefficient CalculationCalculation

42

Dot product

Cosine

Dice

Jaccard

i i iiiii

iii

i iii

iii

i iii

iii

ii

baba

baQDSC

ba

baQDSC

ba

baQDSC

baQDSC

) * (

) * (),(

) * (2),(

*

) * (),(

) * (),(

22

22

22

t1

t2

D

Q

Page 43: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

ExampleExampleQ: “gold silver truck”

• D1: “Shipment of gold delivered in a fire”

• D2: “Delivery of silver arrived in a silver truck”

• D3: “Shipment of gold arrived in a truck” Document Frequency of the jth term (dfj )

• Inverse Document Frequency (idf) = log10(n / dfj)

Tf*idf is used as term weight here

43

Page 44: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Example (Cont’d)Example (Cont’d)Id Term df idf

1 a 3 0

2 arrived 2 0.176

3 damaged 1 0.477

4 delivery 1 0.477

5 fire 1 0.477

6 gold 1 0.176

7 in 3 0

8 of 3 0

9 silver 1 0.477

10 shipment 2 0.176

11 truck 2 0.176

44

Page 45: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Example(Cont’d)Example(Cont’d)Tf*idf is used here

SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031

SC(Q, D2 ) = 0.486

SC(Q,D3) = 0.062

The ranking would be D2,D3,D1.

• This SC uses the dot product.

45

Page 46: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Advantages of VSMAdvantages of VSM Fairly cheap to compute Yields decent effectiveness Very popular -- SMART is one of

the most commonly used academic prototype

46

Page 47: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

Disadvantages of VSMDisadvantages of VSM No theoretical foundation Weights in the vectors are very

arbitrary Assumes term independenceSparse Matrix

47

Page 48: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

48

Page 49: Information Retrieval and Vector Space Model Presented by  Jun Miao York University

49