information retrieval web search. retrieve docs that are “relevant” for the user query doc :...

Information Retrieval

Web Search

Retrieve docs that are “relevant” for the user query

Doc: file word or pdf, web page, email, blog, e-book,...

Query: paradigm “bag of words”

Relevant ?!?

Goal of a Search Engine

Two main difficulties

The Web: Size: more than tens of billions of pages

Language and encodings: hundreds…

Distributed authorship: SPAM, format-less,…

Dynamic: in one year 35% survive, 20% untouched

The User: Query composition: short (2.5 terms avg) and imprecise

Query results: 85% users look at just one result-page

Several needs: Informational, Navigational, Transactional

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

Evolution of Search Engines First generation -- use only on-page, web-text data

Word frequency and language

Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)

Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]

Google, Yahoo,

MSN, ASK,………

This is a search engine!!!

-$

+$

Two new approaches

Sponsored search: Ads driven by search keywords

(and user-profile issuing them)

Context match: Ads driven by the content of a web page

(and user-profile reaching that page)

AdWords

AdSense


The structure of a Search Engine

The structureW

eb

Crawler

Page archive

Control

Query

Queryresolver

?

Ranker

PageAnalizer

textStructure

auxiliary

Indexer


The Web Graph

The Web’s Characteristics

Size 1 trillion of pages is available (Google 7/08) 5-40K per page => hundreds of terabytes Size grows every day!!

Change 8% new pages, 25% new links change weekly Life time of about 10 days

The Bow Tie

SCCSCC

WCCWCC

Some definitions

Weakly connected components (WCC) Set of nodes such that from any node can go to any node via

an undirected path. Strongly connected components (SCC)

Set of nodes such that from any node can go to any node via a directed path.

On observing the Web graph

We do not know which percentage of it we know

The only way to discover the graph structure of the web as hypertext is via large scale crawls

Warning: the picture might be distorted by Size limitation of the crawl Crawling rules Perturbations of the "natural" process of birth and

death of nodes and links

Why is it interesting?

Largest artifact ever conceived by the human

Exploit its structure of the Web for Crawl strategies Search Spam detection Discovering communities on the web Classification/organization

Predict the evolution of the Web Sociological understanding

Many other large graphs… Internet graph

V = Routers E = communication links

“Cosine” graph (undirected, weighted) V = static web pages E = tf-idf distance between pages

Query-Log graph (bipartite, weighted) V = queries and URL E = (q,u) where u is a result for q, and has been clicked

by some user who issued q

Social graph (undirected, unweighted) V = users E = (x,y) if x knows y (facebook, address book, email,..)

Definition

Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)

Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈

2.1

The In-degree distribution

Altavista crawl, 1999 WebBase Crawl 2001

Indegree follows power law distributionk

ku 1

])(degree-inPr[

2.1

Definition

Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈

2.1

Locality: usually most of the hyperlinks point to other URLs on

the same host (about 80%).

Similarity: pages close in lexicographic order tend to share

many outgoing lists

A Picture of the Web Graph

i

j

21 millions of pages, 150millions of links

URL-sorting

Stanford

Berkeley


Crawling

Spidering

24h, 7days “walking” over a Graph

Recall that the Web graph is

A direct graph G = (N, E)

N changes (insert, delete) >> 50 * 109 nodes

E changes (insert, delete) > 10 links per node

BowTie

Crawling Issues

How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load)

How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?

How often to crawl? Freshness: How much has changed?

How to parallelize the process

Crawling picture

Web

URLs crawledand parsed

URLs frontier

Unseen Web

Seedpages

Sec. 20.2

Updated crawling picture

URLs crawledand parsed

Unseen Web

SeedPages

URL frontier

Crawling thread

Sec. 20.1.1

Robots.txt

Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html

Website announces its request on what can(not) be crawled For a URL, create a file of restrictions URL/robots.txt

Sec. 20.2.1

http://www.robotstxt.org/wc/norobots.html

Robots.txt example

No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine":

User-agent: *

Disallow: /yoursite/temp/

User-agent: searchengine

Disallow:

Sec. 20.2.1

Processing steps in crawling

Pick a URL from the frontier Fetch the document at the URL Parse the URL

Extract links from it to other docs (URLs) For each extracted URL

Ensure it passes certain URL filter tests Check if it is already in the frontier

(duplicate URL elimination) Check if URL has content already seen

Duplicate content elimination

E.g., only crawl .edu, obey robots.txt, etc.

Which one?

Sec. 20.2.1

Basic crawl architecture

WWW

DNS

Parse

Contentseen?

DocFP’s

DupURLelim

URLset

URL Frontier

URLfilter

robotsfilters

Fetch

Sec. 20.2.1

Page selection

Given a page P, define how “good” P is.

Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined

BFS

“…BFS-order discovers the highest quality pages during the early stages of the crawl”

328 millions of URL in the testbed

[Najork 01]

This page is a new one ?

Check if file has been parsed or downloaded before

after 20 mil pages, we have “seen” over 200 million

URLs each URL is at least 100 bytes on average Overall we have about 20Gb of URLS

Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication

Dynamic assignment Central coordinator dynamically assigns URLs to

crawlers Links are given to Central coordinator

Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web

Two problems

Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail

www.geocities.com/…. www.di.unipi.it/

Dynamic “relocation” schemes may be complicated

Managing the fault-tolerance: What about the death of downloaders ? DD-1, new

hash !!! What about new downloaders ? DD+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps anURL to {0,...,D-1}.

Dowloader x fetchesthe URLs U s.t. hash(U) = x

A nice technique: Consistent Hashing

A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS

Item and servers mapped to unit circle

Item K assigned to first server N such that ID(N) ≥ ID(K)

What if a downloader goes down?

What if a new downloader appears?Each server gets replicated log S times

[monotone] adding a new server moves points between one old to the new, only.

[balance] Prob item goes to a server is ≤ cost/S

[load] any server gets ≤ (I/S) log S items w.h.p

[scale] you can copy each server more times...

Examples: Open Source

Nutch, also used by WikiSearch http://www.nutch.org

Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html

Consisten Hashing Amazon’s Dynamo

Connectivity Server

Support for fast queries on the web graph Which URLs point to a given URL? Which URLs does a given URL point to?

Stores mappings in memory from URL to outlinks, URL to inlinks

Sec. 20.4

Currently the best

Webgraph – set of algorithms and a java implementation

Fundamental goal – maintain node adjacency lists in memory For this, compressing the adjacency

lists is the critical component

Sec. 20.4

Adjacency lists

The set of neighbors of a node

Assume each URL represented by an integer 4 billion pages 32 bits per node And 64 bits per hyperlink

Sec. 20.4

Adjaceny list compression

Properties exploited in compression: Similarity (between lists) Locality (many links from a page

go to “lexic-nearby” pages) Use gap encodings in sorted lists Distribution of gap values

Sec. 20.4

3 bits/link

Main ideas

Consider lexicographically ordered list of all URLs, e.g., www.stanford.edu/alchemy www.stanford.edu/biology www.stanford.edu/biology/plant www.stanford.edu/biology/plant/copyright www.stanford.edu/biology/plant/people www.stanford.edu/chemistry

Sec. 20.4

Copy lists Each of these URLs has an adjacency list Main idea: due to templates, the adjacency

list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering

Express adjacency list in terms of one of these

E.g., consider these adjacency lists 1, 2, 4, 8, 16, 32, 64 1, 4, 9, 16, 25, 36, 49, 64 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 1, 4, 8, 16, 25, 36, 49, 64

Encode as (-2), 11011111, add 8

Why 7?

Sec. 20.4

Extra nodes and binary arrays

Several tricks: Use RLE over the binary arrays

Use succinct encoding for the intervals created by extra-nodes

Use special interger-codes for the remaining integers code: good for integers from a power law)

Sec. 20.4

Main advantages

Adjacency queries can be answered very efficiently To fetch out-neighbors, trace back the chain of

prototypes This chain is typically short in practice (since

similarity is mostly intra-host) Can also explicitly limit the length of the chain

during encoding

Easy to implement one-pass algorithm

Sec. 20.4

Duplicate documents

The web is full of duplicated content Strict dup-detection = exact match

Not as common

Many cases of near duplicates E.g., Last modified date is the only

difference between two page copies

Sec. 19.6

Duplicate/Near-Duplicate Detection

Duplication: Exact match can be detected with fingerprints

Near-Duplication: Approximate match Overview

Compute syntactic similarity with an edit-distance measure

Use similarity threshold to detect near-duplicates

E.g., Similarity > 80% => Documents are “near duplicates”

Sec. 19.6

Computing Similarity Approach:

Shingles (Word N-Grams) a rose is a rose is a rose → a_rose_is_a rose_is_a_rose is_a_rose_is

a_rose_is_a

Similarity Measure between two docs: Set of shingles + Set intersection

Sec. 19.6

Multiset ofFingerprints

Doc shinglingMultiset ofShingles

fingerprint

Documents Sets of 64-bit fingerprints

Efficient shingle management Fingerprints:• Use Karp-Rabin fingerprints• Use 64-bit fingerprints • Prob[collision] << 1

Similarity of Documents

DocB SB

SADocA

• Jaccard measure – similarity of SA, SB which are sets of integers

•Claim: A & B are near-duplicates if sim(SA,SB) is close to 1

BA

BABA SS

SS )S,sim(S

Remarks

Multiplicities of q-grams – could retain or ignore trade-off efficiency with precision

Shingle Size q [4 … 10] Short shingles: increase similarity of unrelated documents

With q=1, sim(SA,SB) =1 A is permutation of B Need larger q to sensitize to permutation changes

Long shingles: small random changes have larger impact

Similarity Measure Similarity is non-transitive, non-metric But dissimilarity 1-sim(SA,SB) is a metric

[Ukkonen 92] – relate q-gram & edit-distance

Example

A = “a rose is a rose is a rose” B = “a rose is a flower which is a rose” Preserving multiplicity

q=1 sim(SA,SB) = 0.7 SA = {a, a, a, is, is, rose, rose, rose} SB = {a, a, a, is, is, rose, rose, flower, which}

q=2 sim(SA,SB) = 0.5 q=3 sim(SA,SB) = 0.3

Disregarding multiplicity q=1 sim(SA,SB) = 0.6 q=2 sim(SA,SB) = 0.5 q=3 sim(SA,SB) = 0.4285

Efficiency: Sketches

Create a “sketch vector” (of size ~200) for each document Docs that share ≥ t (say 80%) of

elemes in the skecthes are near duplicates

For doc D, sketchD[ i ] is as follows: Let f map all shingles in the universe to

0..2m (e.g., f = fingerprinting) Let i be a random permutation on 0..2m

Pick MIN {i(f(s))} over all shingles s in D

Sec. 19.6

Computing Sketch[i] for Doc1

Document 1

264

264

264

264

Start with 64-bit f(shingles)

Permute on the number linewith i

Pick the min value

Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

Are these equal?

Test for 200 random permutations: , ,… 200

MIN( (f (A) )

Sec. 19.6

same(f(*))

MIN( (f (B) )

Notice that…

Document 1 Document 2

264

264

264

264

264

264

264

264

They are equal iff the shingle with the MIN value in the union of Doc1 and Doc2 is doc in their intersection

Claim: This happens with probability Size_of_intersection / Size_of_union

Sec. 19.6

All signature pairs

This is an efficient method for estimating the similarity (Jaccard coefficient) for one pair of documents.

But we have to estimate N2 similarities, where N is the number of web pages. Still slow One solution: locality sensitive hashing

(LSH) Another solution: sorting

Sec. 19.6


Link-based Ranking(2° generation)

Query-independent ordering

First generation: using link counts as simple measures of popularity.

Undirected popularity: Each page gets a score given by the number of in-links

plus the number of out-links (es. 3+2=5).

Directed popularity: Score of a page = number of its in-links (es. 3).

Easy to SPAM

Second generation: PageRank

Each link has its own importance!!

PageRank is

independent of the query

many interpretations…

Basic Intuition…

What about nodes with no in/out links?

Google’s Pagerank

else

jiioutji

0)(#

1

,

B(i)B(i) : set of pages linking to i. : set of pages linking to i.#out(j)#out(j) : number of outgoing links from j. : number of outgoing links from j.ee : vector of components 1/sqrt{N}. : vector of components 1/sqrt{N}.

Random jump

Principaleigenvector

r = [ T + (1-) e eT ] × r

Three different interpretations

Graph (intuitive interpretation) Co-citation

Matrix (easy for computation) Eigenvector computation or a linear system solution

Markov Chain (useful to prove convergence) a sort of Usage Simulation

Any node

Neighbors

“In the steady state” each page has a long-term visit rate - use this as the page’s score.

Pagerank: use in Search Engines

Preprocessing: Given graph, build matrix Compute its principal eigenvector r r[i] is the pagerank of page i

We are interested in the relative order

Query processing: Retrieve pages containing query terms Rank them by their Pagerank

The final order is query-independent

T + (1-) e eT

HITS: Hypertext Induced Topic Search

Calculating HITS

It is query-dependent

Produces two scores per page: Authority score: a good authority page for

a topic is pointed to by many good hubs for that topic.

Hub score: A good hub page for a topic points to many authoritative pages for that topic.

Authority and Hub scores

2

3

4

1 1

5

6

7

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

HITS: Link Analysis Computation

Wherea: Vector of Authority’s scores

h: Vector of Hub’s scores. A: Adjacency matrix in which ai,j = 1 if ij

hAAh

AaAa

Aah

hAaT

TT

Thus, h is an eigenvector of AAt

a is an eigenvector of AtA

Symmetricmatrices

Weighting links

Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).

yx

yaxh

)()(

xy

yhxa

)()( )(),()(

)(),()(

yhyxwxa

yayxwxh

xy

yx