8/28/97information organization and retrieval ir implementation issues, web crawlers and web search...

8/28/97 Information Organization and Retrieval

IR Implementation Issues, Web Crawlers and Web Search

Engines

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval


Review

• Boolean Retrieval

• Ranked Retrieval

• Vector Space Model

Informationneed

Index

Pre-process

Parse

Collections

Rank or Match

Query

text input


Boolean Model

t33

t11 t22

D11D22

D33

D44D55

D66

D88D77

D99

D1010

D1111

m1

m2

m3m5

m4

m7m8

m6

m2 = t1 t2 t3

m1 = t1 t2 t3

m4 = t1 t2 t3

m3 = t1 t2 t3

m6 = t1 t2 t3

m5 = t1 t2 t3

m8 = t1 t2 t3

m7 = t1 t2 t3


Boolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”

Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete

Cracks

Beams Widthmeasurement

Prestressedconcrete

Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)


Boolean Problems

• Disjunctive (OR) queries lead to information overload

• Conjunctive (AND) queries lead to reduced, and commonly zero result

• Conjunctive queries imply reduction in Recall


Advantages and Disadvantage of the Boolean Model

• Complete expressiveness for any identifiable subset of collection

• Exact and simple to program

• The whole panoply of Boolean Algebra available

Advantages• Complex query syntax

is often misunderstood (if understood at all)

• Problems of Null output and Information Overload

• Output is not ordered in any useful fashion

Disadvantages


Boolean Extensions

• Fuzzy Logic– Adds weights to each term/concept– ta AND tb is interpreted as MIN(w(ta),w(tb))– ta OR tb is interpreted as MAX (w(ta),w(tb))

• Proximity/Adjacency operators– Interpreted as additional constraints on Boolean

AND• TOPIC system

– Uses various weighted forms of Boolean logic and proximity information in calculating RSVs


Vector Space Model

• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of

terms• Queries represented the same as documents• Query and Document weights are based on

length and direction of their vector• A vector distance measure between the

query and documents is used to rank retrieved documents


Documents in Vector Space

t1

t2

t3

D1

D2

D10

D3

D9

D4

D7

D8

D5

D11

D6


Vector Space Documentsand Queries

docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3

D10 0 1 1 5D11 1 0 1 3Q 1 2 3

q1 q2 q3

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

t2

t3

t1


Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient


Vector Space with Term Weights and Cosine Matching

1.0

0.8

0.6

0.4

0.2

0.80.60.40.20 1.0

D2

D1

Q

1

2

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

t

j

t

j dq

t

j dq

i

ijj

ijj

ww

wwDQsim

1 1

22

1

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

64.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

2222

DQsim

74.058.0

56.),( 1 DQsim


Problems with Vector Space

• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real

basis– most similarity measures work about the same

regardless of model

• Terms are not really orthogonal dimensions– Terms are not independent of all other terms


Today

• Probabilistic Retrieval (Introduction)

• Processing Ranked Queries (the role of inverted files)

• Web Crawlers - Distributed indexing of the WWW

• Probabilistic Retrieval (Details)


Probabilistic Retrieval

• Goes back to 1960’s (Maron and Kuhns)

• Robertson’s “Probabilistic Ranking Principle”– Retrieved documents should be ranked in

decreasing probability that they are relevant to the user’s query.

– How to estimate these probabilities?• Several methods (Model 1, Model 2, Model 3) with

different emphases on how estimates are done.


Probabilistic Models: Some Notation

• D = All present and future documents

• Q = All present and future queries

• (Di,Qj) = A document query pair

• x = class of similar documents,

• y = class of similar queries,

• Relevance is a relation:

}Q submittinguser by therelevant judged

isDdocument ,Q ,D | )Q,{(D R

j

ijiji QD

Dx Qy


Probabilistic Models

• Model 1 -- Probabilistic Indexing, P(R|y,Di)

• Model 2 -- Probabilistic Querying, P(R|Qj,x)

• Model 3 -- Merged Model, P(R| Qj, Di)

• Model 0 -- P(R|y,x)

• Probabilities are estimated based on prior usage or relevance estimation



• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query

• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)

• Relies on accurate estimates of probabilities for accurate results


Vector and Probabilistic Models

• Support “natural language” queries• Treat documents and queries the same• Support relevance feedback searching• Support ranked retrieval• Differ primarily in theoretical basis and in how

the ranking is calculated– Vector assumes relevance – Probabilistic relies on relevance judgments or

estimates


Web Search Engines

• Most include some version of Vector Space or extended Boolean

• Some offer both “ranked” and Boolean, but not together.

• Some engines (such as those based on the original WAIS) are little more than coordination-level matching for ranked retrieval.


Web Search Engines

• Some engines use added natural language processing techniques to identify concepts– Lycos based on work by Michael Mauldin at CMU– Excite’s “concept-based” search may be a

development of Latent Semantic Indexing

• Some search engines using Probabilistic methods (with proprietary extensions)– Inktomi/HotBot uses a form of SLR.


Web Search Engines

• Exact algorithms are not available for commercial WWW search engines

• Many search engines appear to be hybrids offering both ranked and Boolean elements


Web Search Conclusions

• Web Search engines are stretching the performance limits of ranked retrieval algorithms

• Most Web search engines today attempt to combine the best features of ranked and Boolean searching

• There is still a long way to go before All and Only the Relevant web pages are retrieved in response to your query


Web Crawlers

• How do the web search engines get all of the items they index?

• How do you store millions of words from hundreds of sites so that you can find them quickly (and efficiently)?


Depth-First Crawling

Page 1

Page 3Page 2

Page 1

Page 2

Page 1

Page 5

Page 6

Page 4Page 1

Page 2

Page 1

Page 3

Site 6

Site 5

Site 3

Site 1 Site 2

Site Page1 11 21 41 61 31 53 15 16 15 22 12 22 3


Breadth First

Page 1

Page 3Page 2

Page 1

Page 2

Page 1

Page 5

Page 6

Page 4Page 1

Page 2

Page 1

Page 3

Site 6

Site 5

Site 3

Site 1 Site 2

Site Page1 12 11 21 61 32 22 31 43 11 55 15 26 1


Inverted Files• We have seen “Vector files” conceptually,

an Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0


How Are Inverted Files Created

• Documents are parsed to extract words (or stems) and these are saved with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2


How Inverted Files are Created

• After all document have been parsed the inverted file is sorted

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2


How Inverted Files are Created

• Multiple term entries for a single document are merged and frequency information added

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2


How Inverted Files are Created• The file is split into a Dictionary and a

Postings fileTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2


Inverted files

• Permit fast search for individual terms• Search results for each term is a list of

document IDs (and optionally, frequency and/or positional information)

• These lists can be used to solve Boolean queries:– country: d1, d2– manor: d2– country and manor: d2


Inverted Files

• Lots of alternative implementations – E.g.: Cheshire builds within-document

frequency using a hash table during parsing– Document IDs and frequency info are stored in

a B-tree index keyed by the term.

• See the chapter on inverted files in the reader for other implementations.


Probabilistic Models (Again)

• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query

• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)

• Relies on accurate estimates of probabilities for accurate results


Probabilistic Models: Logistic Regression

• Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables.

nnkji vcvcvcctdR|qO ...),,(log 22110

)),|(log(1

1),|(

ji dqROjie

dqRP

m

kkjiji ROtdqROdqRO

1, )](log),|([log),|(log

Log odds of relevance is a linear function of attributes:

Term contributions summed:

Probability of Relevance is inverse of log odds:


Probabilistic Models: Logistic Regression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged


Probabilistic Models: Logistic Regression

6

10),|(

iii XccDQRP

Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown previously



• Strong theoretical basis

• In principle should supply the best predictions of relevance given available information

• Can be implemented similarly to Vector

• Relevance information is required -- or is “guestimated”

• Important indicators of relevance may not be term -- though terms only are usually used

• Optimally requires on-going collection of relevance information

Advantages Disadvantages

8/28/97information organization and retrieval ir implementation issues, web crawlers and web search...

Documents

information organization

termconcept t

retrieval slide

wt b t

retrieval documents

proximity information

retrieval advantages

p slide