search advertising, duplicate detection and revisionzhaojin/cs3245_2019/w13.pdfx k is the first...

105
CS3245 Information Retrieval Search Advertising, Duplicate Detection and Revision 3 1

Upload: others

Post on 10-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

Introduction to Information Retrieval

CS3245

Information Retrieval

Search Advertising Duplicate Detection and Revision

31

CS3245 ndash Information Retrieval

Last TimeChapter 20 Crawling

Chapter 21 Link Analysis

Ch 11-12

2

CS3245 ndash Information Retrieval

Today Exam Matters

Revision

Where to go from here

3

Chapter 19 Search Advertising Duplicate Detection

CS3245 ndash Information Retrieval

Web search overview

4

Ch 19

CS3245 ndash Information Retrieval

IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation

financing interest aggregation etc rarr look at search ads (Search Advertising)

The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)

5

Ch 19

CS3245 ndash Information Retrieval

Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create

content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from

it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays

for the web6

Ch 19

CS3245 ndash Information Retrieval

Interest aggregation Unique feature of the web A small number of

geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into

relatively portable C (open source project) Search engines are a key enabler for interest

aggregation

7

Ch 19

CS3245 ndash Information Retrieval

SEARCHADVERTISING

8

Sec 193

CS3245 ndash Information Retrieval

1st Generation of Search AdsGoto (1996)

9

Sec 193

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 2: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Last TimeChapter 20 Crawling

Chapter 21 Link Analysis

Ch 11-12

2

CS3245 ndash Information Retrieval

Today Exam Matters

Revision

Where to go from here

3

Chapter 19 Search Advertising Duplicate Detection

CS3245 ndash Information Retrieval

Web search overview

4

Ch 19

CS3245 ndash Information Retrieval

IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation

financing interest aggregation etc rarr look at search ads (Search Advertising)

The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)

5

Ch 19

CS3245 ndash Information Retrieval

Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create

content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from

it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays

for the web6

Ch 19

CS3245 ndash Information Retrieval

Interest aggregation Unique feature of the web A small number of

geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into

relatively portable C (open source project) Search engines are a key enabler for interest

aggregation

7

Ch 19

CS3245 ndash Information Retrieval

SEARCHADVERTISING

8

Sec 193

CS3245 ndash Information Retrieval

1st Generation of Search AdsGoto (1996)

9

Sec 193

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 3: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Today Exam Matters

Revision

Where to go from here

3

Chapter 19 Search Advertising Duplicate Detection

CS3245 ndash Information Retrieval

Web search overview

4

Ch 19

CS3245 ndash Information Retrieval

IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation

financing interest aggregation etc rarr look at search ads (Search Advertising)

The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)

5

Ch 19

CS3245 ndash Information Retrieval

Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create

content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from

it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays

for the web6

Ch 19

CS3245 ndash Information Retrieval

Interest aggregation Unique feature of the web A small number of

geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into

relatively portable C (open source project) Search engines are a key enabler for interest

aggregation

7

Ch 19

CS3245 ndash Information Retrieval

SEARCHADVERTISING

8

Sec 193

CS3245 ndash Information Retrieval

1st Generation of Search AdsGoto (1996)

9

Sec 193

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 4: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Web search overview

4

Ch 19

CS3245 ndash Information Retrieval

IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation

financing interest aggregation etc rarr look at search ads (Search Advertising)

The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)

5

Ch 19

CS3245 ndash Information Retrieval

Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create

content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from

it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays

for the web6

Ch 19

CS3245 ndash Information Retrieval

Interest aggregation Unique feature of the web A small number of

geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into

relatively portable C (open source project) Search engines are a key enabler for interest

aggregation

7

Ch 19

CS3245 ndash Information Retrieval

SEARCHADVERTISING

8

Sec 193

CS3245 ndash Information Retrieval

1st Generation of Search AdsGoto (1996)

9

Sec 193

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 5: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation

financing interest aggregation etc rarr look at search ads (Search Advertising)

The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)

5

Ch 19

CS3245 ndash Information Retrieval

Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create

content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from

it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays

for the web6

Ch 19

CS3245 ndash Information Retrieval

Interest aggregation Unique feature of the web A small number of

geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into

relatively portable C (open source project) Search engines are a key enabler for interest

aggregation

7

Ch 19

CS3245 ndash Information Retrieval

SEARCHADVERTISING

8

Sec 193

CS3245 ndash Information Retrieval

1st Generation of Search AdsGoto (1996)

9

Sec 193

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 6: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create

content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from

it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays

for the web6

Ch 19

CS3245 ndash Information Retrieval

Interest aggregation Unique feature of the web A small number of

geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into

relatively portable C (open source project) Search engines are a key enabler for interest

aggregation

7

Ch 19

CS3245 ndash Information Retrieval

SEARCHADVERTISING

8

Sec 193

CS3245 ndash Information Retrieval

1st Generation of Search AdsGoto (1996)

9

Sec 193

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 7: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Interest aggregation Unique feature of the web A small number of

geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into

relatively portable C (open source project) Search engines are a key enabler for interest

aggregation

7

Ch 19

CS3245 ndash Information Retrieval

SEARCHADVERTISING

8

Sec 193

CS3245 ndash Information Retrieval

1st Generation of Search AdsGoto (1996)

9

Sec 193

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 8: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

SEARCHADVERTISING

8

Sec 193

CS3245 ndash Information Retrieval

1st Generation of Search AdsGoto (1996)

9

Sec 193

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 9: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

1st Generation of Search AdsGoto (1996)

9

Sec 193

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 10: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 11: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 12: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 13: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 14: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 15: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 16: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Googlersquos second price auction

Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 17: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 18: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 19: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 20: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 21: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

DUPLICATEDETECTION

21

Sec 196

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 22: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 23: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Near-duplicates Example

23

Sec 196

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 24: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 25: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 26: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 27: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 28: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 29: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 30: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 31: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 32: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 33: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 34: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 35: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

EXAM MATTERS

35

Sec 196

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 36: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 37: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 38: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 39: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 40: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

40

REVISION

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 41: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 42: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 43: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 44: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 45: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 46: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 47: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 48: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 49: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 50: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 51: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 52: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 53: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 54: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 55: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 56: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 57: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 58: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 59: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 60: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45

60Information Retrieval

Loop

for

log lev

els

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 61: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 62: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7

Page 63: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 64: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 65: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 66: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 67: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 68: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 69: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors

Information Retrieval 69

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 70: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Information Retrieval 70

Sec 64

tf-idf weighting has many variants

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 71: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 72: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Recap Computing cosine scores

72

Sec 633

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 73: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 74: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 75: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 76: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 77: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 78: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 79: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Kappa measure for inter-judge (dis)agreement

Information Retrieval 79

Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 80: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 81: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 82: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 83: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 84: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 85: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 86: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 87: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt

ltd307gt ltd608gt ltd905gt

This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo

eg authorldquoBillrdquo

eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 88: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 89: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 90: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 91: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms

Information Retrieval

Sec 1131

91

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 92: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 93: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 94: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 95: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 96: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Week 12 Web SearchChapter 20 Crawling

Chapter 21 Link Analysis

96

Ch 20-21

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 97: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection

Information Retrieval 97

Sec 201

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 98: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling

Information Retrieval 98

Sec 202

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 99: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

99

Pagerank algorithm

Information Retrieval

Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 100: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 101: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

101

WHERE TO GO FROM HERE

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 102: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 103: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 104: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

104

Opportunities in IR

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105
Page 105: Search Advertising, Duplicate Detection and Revisionzhaojin/cs3245_2019/w13.pdfx k is the first non-zero entry in π(d k), i.e., first shingle present in document k π π π x i≠x

CS3245 ndash Information Retrieval

Thanks for joining us for the journey

105

  • Slide Number 1
  • Last Time
  • Today
  • Web search overview
  • IR on the web vs IR in general
  • Without search engines the web wouldnrsquot work
  • Interest aggregation
  • SEARCHADVERTISING
  • 1st Generation of Search AdsGoto (1996)
  • 1st Generation of search adsGoto (1996)
  • 2nd generation of search ads Google (2000)
  • Do ads influence editorial content
  • How are ads ranked
  • How are ads ranked
  • Googlersquos second price auction
  • Googlersquos second price auction
  • Keywords with high bids
  • Search ads A win-win-win
  • Not a win-win-win Keyword arbitrage
  • Not a win-win-win Violation of trademarks
  • DUPLICATEDETECTION
  • Duplicate detection
  • Near-duplicates Example
  • Detecting near-duplicates
  • Recall Jaccard coefficient
  • Jaccard coefficient Example
  • A document as set of shingles
  • Fingerprinting
  • Documents as sketches
  • Slide Number 30
  • Proof that J(S(di)s(dj)) cong P(xi = xj)
  • Estimating Jaccard
  • Shingling Summary
  • Summary
  • EXAM MATTERS
  • Slide Number 36
  • Exam Format
  • Exam Topics
  • Exam Topic Distribution
  • REVISION
  • Week 1 Ngram Models
  • Unigram LM with Add 1 smoothing
  • Week 2 Basic (Boolean) IR
  • Indexer steps Dictionary amp Postings
  • The merge
  • Week 3 Postings lists and Choosing terms
  • Adding skip pointers to postings
  • Biword indexes
  • Positional index example
  • Choosing terms
  • Slide Number 51
  • Hash Table
  • Wildcard queries
  • Spelling correction
  • Week 5 Index construction
  • BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
  • SPIMI Single-pass in-memory indexing
  • Distributed Indexing MapReduce Data flow
  • Dynamic Indexing 2nd simplest approach
  • Logarithmic merge
  • Week 6 Index Compression
  • Index CompressionDictionary-as-a-String and Blocking
  • Front coding
  • Postings CompressionPostings file entry
  • Variable Byte Encoding Example
  • Week 7 Vector space ranking
  • tf-idf weighting
  • Queries as vectors
  • Cosine for length-normalized vectors
  • Slide Number 70
  • Week 8 Complete Search Systems
  • Recap Computing cosine scores
  • Generic approach
  • Heuristics
  • Net score
  • Parametric Indices
  • Week 9 IR Evaluation
  • A precision-recall curve
  • Kappa measure for inter-judge (dis)agreement
  • AB testing
  • Week 10 Relevance Feedback Query Expansion and XML IR
  • Rocchio (1971)
  • Query Expansion
  • Co-occurrence Thesaurus
  • Main idea lexicalized subtrees
  • Context resemblance
  • SimNoMerge example
  • INEX relevance assessments
  • Week 11 Probabilistic IR
  • Binary Independence Model (BIM)
  • Deriving a Ranking Function for Query Terms
  • Okapi BM25 A Nonbinary Model
  • An Appraisal of Probabilistic Models
  • Using language models in IR
  • Mixture model Summary
  • Week 12 Web Search
  • Crawling
  • Robotstxt
  • Pagerank algorithm
  • Pagerank summary
  • WHERE TO GO FROM HERE
  • Learning Objectives
  • Slide Number 103
  • Slide Number 104
  • Slide Number 105