search advertising, duplicate detection and revisionzhaojin/cs3245_2019/w13.pdfx k is the first...

Introduction to Information Retrieval

CS3245

Information Retrieval

Search Advertising Duplicate Detection and Revision

31

CS3245 ndash Information Retrieval

Last TimeChapter 20 Crawling

Chapter 21 Link Analysis

Ch 11-12

2


Today Exam Matters

Revision

Where to go from here

3

Chapter 19 Search Advertising Duplicate Detection


Web search overview

4

Ch 19


IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation

financing interest aggregation etc rarr look at search ads (Search Advertising)

The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)

5

Ch 19


Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create

content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from

it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays

for the web6

Ch 19


Interest aggregation Unique feature of the web A small number of

geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into

relatively portable C (open source project) Search engines are a key enabler for interest

aggregation

7

Ch 19


SEARCHADVERTISING

8

Sec 193


1st Generation of Search AdsGoto (1996)

9

Sec 193


1st Generation of search adsGoto (1996)

Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the

link Pages were simply ranked according to bid ndash revenue

maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking

but Goto did not pretend there was any10

Sec 193


2nd generation of search ads Google (2000)

11

SogoTrade appearsin searchresults

SogoTrade appearsin ads

Do search enginesrank advertisershigher thannon-advertisers

All major searchengines claim ldquonordquo

Sec 193


Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major

advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet

Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)

12

Sec 193


How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your

ad (ie CPC)

How does the auction determine an adrsquos rank and the price paidfor the ad

Basis is a second price auction but with twists For the bottom line this is perhaps the most important

research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means

billions of additional revenue for the search engine

13

Sec 193


How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads

Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =

clicks per impression Result A nonrelevant ad will be ranked low

Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is

maximized if users get useful information

14

Sec 193


Googlersquos second price auction

bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what

percentage of time do users click on it CTR is a measure of relevance

ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is

rank rank in auction paid second price auction price paid by advertiser

15

Sec 193



Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction

price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)

price1 = bid2 times CTR2 CTR1

p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050

16

Sec 193


Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim

17

Sec 193

httpwwwwordstreamcomarticlesmost-expensive-keywords


Search ads A win-win-win The search engine company gets revenue every time

somebody clicks on an ad The user only clicks on an ad if they are interested in

the ad Search engines punish misleading and nonrelevant

ads As a result users are often satisfied with what they

find after clicking on an ad The advertiser finds new customers in a cost-

effective way

18

Sec 193


Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying

much more than you are paying Google Eg redirect to a page full of ads

This rarely makes sense for the user

(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval

19

Sec 193


Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe

httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site

20

Sec 193


DUPLICATEDETECTION

21

Sec 196


Duplicate detectionThe web is full of duplicated content More so than many other collections

Exact duplicates Easy to detect use hashfingerprint (eg MD5)

Near-duplicates More common on the web difficult to eliminate

For the user itrsquos annoying to get a search result with near-identical documents

Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate

22

Sec 196


Near-duplicates Example

23

Sec 196


Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult

to compute

We do not consider documents near-duplicates if they have the same content but express it with different words

Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo

Eg two documents are near-duplicates if similarity gt θ = 80

24

Sec 196


Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient

JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1

25

Sec 196


Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo

d2 ldquoJack London traveled to the city of Oaklandrdquo

d3 ldquoJack traveled from Oakland to Londonrdquo

Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)

J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity

26

Sec 196


A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic

similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo

would be represented as this set of shingles a-rose-is rose-is-a is-a-rose

We define the similarity of two documents as the Jaccard coefficient of their shingle sets

27

Sec 196


Fingerprinting We can map shingles to a large integer space

[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m

This doesnrsquot directly help us ndash we are just converting strings to large integers

But it will help us compute an approximation to the actual Jaccard coefficient quickly

28

Sec 196


Documents as sketches The number of shingles per document is large

difficult to exhaustively compare To make it fast we use a sketch a subset of the

shingles of a document The size of a sketch is say n = 200 and is defined by a

set of permutations π1 π200 Each πi is a random permutation on 12m

The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt

(a vector of 200 numbers)

29

Sec 196


30

Deriving a sketch element a permutation of the original hashes

document 1 sk document 2 sk

We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2

Sec 196

30


Proof that J(S(di)s(dj)) cong P(xi = xj)

Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise

J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)

31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj

π(1)Si Sj Si Sj Si Sj Si Sj

π(2) π(3) π(n=200)

We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in

set Sj is present Permutation π(n) is a random

reordering of the rows in A xk is the first non-zero entry

in π(dk) ie first shingle present in document k

π π

π

xinexj xinexj xi=xjxi=xj

Sec 196


Estimating Jaccard Thus the proportion of successful permutations is the

Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)

Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial

Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)

Our sketch is based on a random selection of permutations

Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200

kn = k200 estimates J(d1 d2)32

Sec 196


Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash

functions Compute N sketches 200 times N matrix shown on

previous slide one row per permutation one column per document

Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence

class33

Sec 196


Summary Search Advertising A type of crowdsourcing Ask advertisers how much they

want to spend ask searchers how relevant an ad is Auction Mechanics

Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random

trials

34


EXAM MATTERS

35

Sec 196


36

Subscribe via NUS EduRec System by 20 May 2019

For more information

Get Exam Results earlier via SMS on 3 Jun 2019

httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht

ml


Exam Format

37

Date 27 Apr (1pm)

Venue SR3 (COM1-02-12)

Open book


Exam Topics Anything covered in lecture (through slides) and

corresponding sections in textbook

For sample questions refer to tutorials midterm and past year exam papers

If in doubt ask on the forum

38


Exam Topic Distribution Focus on topics covered from Weeks 5 to 12

Q1 is truefalse questions on various topics (20 marks)

Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator

39


40

REVISION


Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words

Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict

the nth token Larger n-gram models more accurate but each increase in

order requires exponentially more space

LMs can be used to do information retrieval as introduced in Week 11

41


Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including

those that are not seen

eg assuming |V| = 11

Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith

42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20

Total of word types in both

lines


Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings

Key characteristic Sorted order for postings

Boolean query processing Intersection by linear time ldquomergingrdquo

Simple optimizations by expected size

43

Ch 1


Indexer steps Dictionary amp Postings Multiple term

entries in a single document are merged

Split into Dictionary and Postings

Doc frequencyinformation is also stored

Why frequency

Sec 12

44


The merge Walk through the two postings simultaneously in

time linear in the total number of postings entries

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID

Sec 13

45


Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers

Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries

Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming

46

Ch 2


Adding skip pointers to postings

Done at indexing time Why To skip postings that will not figure in the

search results How to do it And where do we place skip pointers

47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23


Biword indexes Index every consecutive pair of terms in the text as a

phrase bigram model using words For example the text ldquoFriends Romans

Countrymenrdquo would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate

48

Sec 241


Positional index example

For phrase queries we use a merge algorithm recursively at the document level

Now need to deal with more than just equality

49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt

Quick checkWhich of docs 1245could contain ldquoto be

or not to berdquo

Sec 242


Choosing terms Definitely language specific

In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming

50


Data Structures for the Dictionary Hash Trees

Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux

2 Spelling Correction Edit Distance Ngrams re-redux

3 Phonetic ndash Soundex

51

Week 4 The dictionary and tolerant retrieval


Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)

Cons No easy way to find minor variants

judgmentjudgement

No prefix search If vocabulary keeps growing need to occasionally do the

expensive operation of rehashing everything

52

Sec 31

Not very tolerant


Wildcard queries How can we handle rsquos in the middle of query term cotion

We could look up co AND tion in a B-tree and intersect the two term sets Expensive

The solution transform wild-card queries so that the always occurs at the end

This gives rise to the Permuterm index Alternative K-gram index with postfiltering

53

Sec 32


Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the

words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives

1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap

Context-sensitive spelling correction Try all possible resulting phrases with one word

ldquocorrectedrdquo at a time and suggest the one with most hits54

Sec 332


Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing

Merge sort is effective for disk-based sorting (avoid seeks)

Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur

Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge

55


BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map

terms to termIDs of 4 bytes

These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection

Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order

Sec 42

56Information Retrieval


SPIMI Single-pass in-memory indexing

Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks

Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur

With these two ideas we can generate a complete inverted index for each block

These separate indices can then be merged into one big index

Sec 43

57


Distributed Indexing MapReduce Data flow

58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters

Map phase Reduce phaseSegment files

a-b

c-d

y-z

Postings


Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation

bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary

index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which

each posting is merged up to log T times

Sec 45

59


Logarithmic merge Idea maintain a series of indexes each twice as large

as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0

or merge with I0 (if I0 already exists) as Z1

Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2

hellip etc

Sec 45


Loop

for

log lev

els


Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo

and Zipfrsquos laws

Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding

Postings compression Gap encoding

61


Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4

Need to store term lengths (1 extra byte)

62

hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip

Save 9 bytes on 3 pointers

Lose 4 bytes onterm lengths

Sec 52


Front coding Sorted words commonly have long common prefix ndash

store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation

Information Retrieval 63

rarr8automata1loze2lozic3lozion

Encodes automat Extra lengthbeyond automat

Begins to resemble general string compression

Sec 52


Postings CompressionPostings file entry We store the list of docs containing a term in

increasing order of docID computer 3347154159202 hellip

Consequence it suffices to store gaps 3314107543 hellip

Hope most gaps can be encodedstored with far fewer than 20 bits

64

Sec 53


Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110

10111000 10000101 00001101

00001100 10110001

65

Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001

Key property VB-encoded postings areuniquely prefix-decodable

For a small gap (5) VB uses a whole byte

Sec 53

512+256+32+16+8 = 824


Week 7 Vector space ranking

1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf

vector3 Compute the cosine similarity score for the query

vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user

66


tf-idf weighting

The tf-idf weight of a term is the product of its tfweight and its idf weight

Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf

Increases with the number of occurrences within a document

Increases with the rarity of the term in the collection67

)df(log)tflog1(w 10 tdt Ndt

times+=

Sec 622


Queries as vectors Key idea 1 Do the same for queries represent them

as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their

proximity to the query in this space

proximity = similarity of vectors asymp inverse of distance asymp cosine similarity

Motivation Want to rank more relevant documents higher than less relevant documents

68

Sec 63


For length-normalized vectors cosine similarity is simply the dot product (or scalar product)

for length normalized 119902 and 119889119889

Cosine for length-normalized vectors




Sec 64

tf-idf weighting has many variants


Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to

compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query

term proximity

Resources for today IIR 7 61

71


Recap Computing cosine scores

72

Sec 633


Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has

many docs from among the top K Return the top K docs in A

Think of A as pruningnon-contenders

The same approach can also used for other (non-cosine) scoringfunctions

73

Sec 711

NJ

A

K


Heuristics Index Elimination High-idf query terms only Docs containing many query terms

Champion lists Generalized to high-low lists and tiered index

Impact ordering postings Early termination idf ordered terms

Cluster pruning

74

Sec 711


Net score Consider a simple total score combining cosine

relevance and authority

net-score(qd) = g(d) + cosine(qd)

Can use some other linear combination than an equal weighting

Indeed any function of the two ldquosignalsrdquo of user happiness

Now we seek the top K docs by net score

75

Sec 714


Parametric Indices

Fields Year = 1601 is an example of

a field Field or parametric index

postings for each field value Sometimes build range (B-

tree) trees (eg for dates)

Field query typically treated as conjunction (doc must be authored by

shakespeare)

Zone A zone is a region of the doc

that can contain an arbitrary amount of text eg Title Abstract References hellip

Build inverted indexes on zones as well to permit querying

76

Sec 61


Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection

AB Testing

77

Ch 8


A precision-recall curve

00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84


Kappa measure for inter-judge (dis)agreement


Sec 85

Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement

Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance

Gives 0 for chance agreement 1 for total agreement


AB testingPurpose Test a single innovationPrerequisite You have a large system with many users

Have most users use old system Divert a small proportion of traffic (eg 1) to the new

system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion

(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user

happiness Probably the evaluation methodology that large search

engines trust most80

Sec 863


Week 10 Relevance Feedback Query Expansion and XML IR

Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model

for XML IR Evaluation of XML IR

Chapter 9 Query Refinement1 Relevance Feedback

Document Level

Explicit RF ndash Rocchio (1971) When does it work

Variants Implicit and Blind

2 Query ExpansionTerm Level

Manual thesaurus

Automatic Thesaurus Generation

81

CS3245 ndash Information Retrieval Sec 922

82

Rocchio (1971)

In practice

Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments

from a few documentsαβγ = weights (hand-chosen or set empirically)

Popularized in the SMART system (Salton)


Query Expansion

In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents

In query expansion users give additional input (goodbad search term) on words or phrases

Sec 922

83


Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)

For each ti pick terms with high values in C

t i

dj N

M

Sec 923

In NLTK Did you forget

84


Main idea lexicalized subtrees Aim to have each dimension of the vector space

encode a word together with its position within the XML tree

How Map XML documents to lexicalized subtrees

85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates

Microsoft Bill Gates

Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book

ldquoWith wordsrdquo


Context resemblance A simple measure of the similarity of a path cq in a query and a path

cq in a document is the following context resemblance function CR

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd iff we can transform cq into cd by inserting additional nodes

86


SimNoMerge example

87

ltc1tgt

query Inverted index

ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt

ltd2025gt ltd301gt ltd1209gt


This example is slightly different from book

CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060

if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5

All weights have been normalized

eg authorldquoBillrdquo


eg titleldquoBillrdquo

eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight

OK to ignore Query vs ltc2 tgt since CR = 00

Query vs ltc1tgt Query vs ltc3tgt


INEX relevance assessments The relevance-coverage combinations are quantized as follows

This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as

88


Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval

Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)

Chapter 121 Language Models for IR

Ch 11-12

89


90

Binary Independence Model (BIM) Traditionally used with the PRP

Assumptions Binary (equivalent to Boolean) documents and

queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation

Independence no association between terms (not true but works in practice ndash naiumlve assumption)

rarr

Sec 113

90


So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905

The odds ratio is the ratio of two odds

(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905

) and

(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905

)

119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents

119888119888119905119905 functions as a term weight so that

Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM

91

Deriving a Ranking Function for Query Terms


Sec 1131

91


Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query

terms

tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the

query No length normalization of queries

(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a

development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075

Sec 1143

92


An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and

lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in

the exact same way Difference for probabilistic IR at the end you score

queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory

Sec 1141

93


Using language models in IR Each document is treated as (the basis for) a language

model Given a query q rank documents based on P(d|q)

P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d

But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)

P(q|d) is the probability of q given d So to rank documents according to relevance to q

ranking according to P(q|d) and P(d|q) is equivalent

94

Sec 122


Mixture model Summary

What we model The user has a document in mind and generates the query from this document

The equation represents the probability that the document that the user had in mind was in fact this one

95

Sec 1222


Week 12 Web SearchChapter 20 Crawling


96

Ch 20-21


Crawling Politeness need to crawl only the allowed pages and

space out the requests for a site over a longer period (hours days)

Duplicates need to integrate duplicate detection

Scale need to use a distributed approach

Spam and spider traps need to integrate spam detection


Sec 201


Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access

to a website originally from 1994 Example

User-agent Disallow yoursitetemp

User-agent searchengine Disallow

Important cache the robotstxt file of each site we are crawling


Sec 202


99

Pagerank algorithm


Sec 2122

Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a

Algorithm multiply x by increasing powers of A until the product looks stable


100

Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1

Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent

Sec 2122


101

WHERE TO GO FROM HERE


Learning Objectives You have learnt how to build your own search

engine

In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward

programming languages to use NLTK ndash A good set of routines and data that are useful in

dealing with NLP and IR

102


103Photo credits httpwwwflickrcomphotosaperezdc

Adversarial IR

Boolean IRVSM

Prob IR

Computational Advertising

Recommendation Systems

Digital Libraries

Natural Language Processing

Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR


Thanks for joining us for the journey

105

Slide Number 1

Last Time

Today

Web search overview

IR on the web vs IR in general

Without search engines the web wouldnrsquot work

Interest aggregation

SEARCHADVERTISING




Do ads influence editorial content

How are ads ranked

How are ads ranked



Keywords with high bids

Search ads A win-win-win

Not a win-win-win Keyword arbitrage

Not a win-win-win Violation of trademarks

DUPLICATEDETECTION

Duplicate detection


Detecting near-duplicates

Recall Jaccard coefficient

Jaccard coefficient Example

A document as set of shingles

Fingerprinting

Documents as sketches

Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics

Exam Topic Distribution

REVISION

Week 1 Ngram Models

Unigram LM with Add 1 smoothing

Week 2 Basic (Boolean) IR

Indexer steps Dictionary amp Postings

The merge

Week 3 Postings lists and Choosing terms


Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction

Week 5 Index construction

BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)



Dynamic Indexing 2nd simplest approach

Logarithmic merge

Week 6 Index Compression

Index CompressionDictionary-as-a-String and Blocking

Front coding

Postings CompressionPostings file entry

Variable Byte Encoding Example


tf-idf weighting

Queries as vectors


Slide Number 70

Week 8 Complete Search Systems


Generic approach

Heuristics

Net score

Parametric Indices

Week 9 IR Evaluation



AB testing


Rocchio (1971)

Query Expansion

Co-occurrence Thesaurus

Main idea lexicalized subtrees

Context resemblance

SimNoMerge example

INEX relevance assessments

Week 11 Probabilistic IR

Binary Independence Model (BIM)


Okapi BM25 A Nonbinary Model

An Appraisal of Probabilistic Models

Using language models in IR


Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


Last TimeChapter 20 Crawling


Ch 11-12

2


Today Exam Matters

Revision


3



Web search overview

4

Ch 19





5

Ch 19





for the web6

Ch 19





aggregation

7

Ch 19


SEARCHADVERTISING

8

Sec 193



9

Sec 193







Sec 193



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


Today Exam Matters

Revision


3



Web search overview

4

Ch 19





5

Ch 19





for the web6

Ch 19





aggregation

7

Ch 19


SEARCHADVERTISING

8

Sec 193



9

Sec 193







Sec 193



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


Web search overview

4

Ch 19





5

Ch 19





for the web6

Ch 19





aggregation

7

Ch 19


SEARCHADVERTISING

8

Sec 193



9

Sec 193







Sec 193



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





5

Ch 19





for the web6

Ch 19





aggregation

7

Ch 19


SEARCHADVERTISING

8

Sec 193



9

Sec 193







Sec 193



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





for the web6

Ch 19





aggregation

7

Ch 19


SEARCHADVERTISING

8

Sec 193



9

Sec 193







Sec 193



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





aggregation

7

Ch 19


SEARCHADVERTISING

8

Sec 193



9

Sec 193







Sec 193



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


SEARCHADVERTISING

8

Sec 193



9

Sec 193







Sec 193



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7



9

Sec 193







Sec 193



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







Sec 193



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7



11





Sec 193





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





12

Sec 193



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7



ad (ie CPC)





13

Sec 193







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







14

Sec 193







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







15

Sec 193







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







16

Sec 193



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7



17

Sec 193








effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







effective way

18

Sec 193






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






19

Sec 193




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7




20

Sec 193


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


DUPLICATEDETECTION

21

Sec 196







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







22

Sec 196



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7



23

Sec 196



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7



to compute




24

Sec 196




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7




25

Sec 196







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







26

Sec 196






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






27

Sec 196






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






28

Sec 196








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7








29

Sec 196


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


30




Sec 196

30





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





31

0 1

1 0

1 1

0 0

1 1

0 1

0 1

0 0

1 1

1 0

1 1

0 1

1 1

0 0

0 1

1 1

1 0

0 1

1 0

0 1

0 1

1 1

1 1

0 0

0 0

1 1

0 1

1 1

0 1

1 0

hellip

A

J(sisj) = 25

Si Sj


π(2) π(3) π(n=200)





π π

π


Sec 196









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7









Sec 196






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






class33

Sec 196





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





trials

34


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


EXAM MATTERS

35

Sec 196


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


36





ml


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


Exam Format

37

Date 27 Apr (1pm)


Open book






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






38





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





39


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7


40

REVISION







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







41






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






42

I 2 eyes 2

donrsquot 2 your 1

want 2 love 1

to 2 and 1

close 2 revenge 1

my 2 Total Count 18

I 3 eyes 1

donrsquot 1 your 3

want 3 love 2

to 1 and 2

close 1 revenge 2

my 1 Total Count 20


lines






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






43

Ch 1






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






Why frequency

Sec 12

44




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7




34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8


Sec 13

45





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





46

Ch 2





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





47

1282 4 8 41 48 64

311 2 3 8 11 17 213111

41 128

Sec 23






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






immediate

48

Sec 241





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7





49

ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt


or not to berdquo

Sec 242




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7




50






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






51





judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7




judgmentjudgement



52

Sec 31

Not very tolerant






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






53

Sec 32







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







Sec 332






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






55






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






Sec 42








Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7







Sec 43

57



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7



58

hellip

MasterParsers

a-b c-d y-zhellip

a-b c-d y-zhellip

a-b c-d y-zhellip

Inverters


a-b

c-d

y-z

Postings






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






Sec 45

59






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7






hellip etc

Sec 45


Loop

for

log lev

els



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7



and Zipfrsquos laws



61




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7




62




Sec 52








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105

Freq

Postings ptr

Term ptr

33

29

44

126

7








Sec 52






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105






64

Sec 53



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105



10111000 10000101 00001101

00001100 10110001

65




Sec 53

512+256+32+16+8 = 824






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105






66


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


tf-idf weighting






times+=

Sec 622







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105







68

Sec 63








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105








Sec 64





term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105




term proximity


71



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105



72

Sec 633






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105






73

Sec 711

NJ

A

K





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105





Cluster pruning

74

Sec 711








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105








75

Sec 714


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


Parametric Indices






shakespeare)




76

Sec 61



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105



AB Testing

77

Ch 8



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105



00

02

04

06

08

10

00 02 04 06 08 10Recall

Pre

cisi

on

78

Sec 84




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105




Sec 85











Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105








Sec 863






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105






Document Level




Manual thesaurus


81


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


82

Rocchio (1971)

In practice





Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


Query Expansion



Sec 922

83




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105




t i

dj N

M

Sec 923


84





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105





85

Book

Title Author

Bill GatesMicrosoft

Author

Bill Gates


Title

Microsoft

Author

Gates

Author

Bill

Book

Title

Microsoft

Book







86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105






86


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


SimNoMerge example

87

ltc1tgt


ltc1tgt

ltc2tgt

ltc3tgt

dictionary

ltd105gt

postings

ltd401gt ltd902gt




CR(c1c1) = 10

CR(c1c2) = 00

CR(c1c3) = 060












88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105




88





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105





Ch 11-12

89


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


90





rarr

Sec 113

90





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105





) and


)




91



Sec 1131

91



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105



terms





Sec 1143

92






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105






Sec 1141

93








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105








94

Sec 122





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105





95

Sec 1222




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105




96

Ch 20-21








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105








Sec 201








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105








Sec 202


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


99

Pagerank algorithm


Sec 2122




100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


100



Sec 2122


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


101




engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105



engine




102



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105



Adversarial IR

Boolean IRVSM

Prob IR



Digital Libraries


Question Answering

IR Theory

Distributed IR

Geographic IR

Social Network IR

Query Analysis


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105


104

Opportunities in IR



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105



105

Slide Number 1

Last Time

Today

Web search overview




SEARCHADVERTISING





How are ads ranked

How are ads ranked







DUPLICATEDETECTION

Duplicate detection






Fingerprinting


Slide Number 30


Estimating Jaccard

Shingling Summary

Summary

EXAM MATTERS

Slide Number 36

Exam Format

Exam Topics


REVISION

Week 1 Ngram Models




The merge



Biword indexes


Choosing terms

Slide Number 51

Hash Table

Wildcard queries

Spelling correction






Logarithmic merge



Front coding




tf-idf weighting

Queries as vectors


Slide Number 70



Generic approach

Heuristics

Net score

Parametric Indices




AB testing


Rocchio (1971)

Query Expansion



Context resemblance

SimNoMerge example









Week 12 Web Search

Crawling

Robotstxt

Pagerank algorithm

Pagerank summary


Learning Objectives

Slide Number 103

Slide Number 104

Slide Number 105