information retrieval web search. retrieve docs that are “relevant” for the user query doc :...
Post on 15-Jan-2016
218 views
TRANSCRIPT
Information Retrieval
Web Search
Retrieve docs that are “relevant” for the user query
Doc: file word or pdf, web page, email, blog, e-book,...
Query: paradigm “bag of words”
Relevant ?!?
Goal of a Search Engine
Two main difficulties
The Web: Size: more than tens of billions of pages
Language and encodings: hundreds…
Distributed authorship: SPAM, format-less,…
Dynamic: in one year 35% survive, 20% untouched
The User: Query composition: short (2.5 terms avg) and imprecise
Query results: 85% users look at just one result-page
Several needs: Informational, Navigational, Transactional
Extracting “significant data” is difficult !!
Matching “user needs” is difficult !!
Evolution of Search Engines First generation -- use only on-page, web-text data
Word frequency and language
Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)
Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data
1995-1997 AltaVista, Excite, Lycos, etc
1998: Google
Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]
Google, Yahoo,
MSN, ASK,………
This is a search engine!!!
-$
+$
Two new approaches
Sponsored search: Ads driven by search keywords
(and user-profile issuing them)
Context match: Ads driven by the content of a web page
(and user-profile reaching that page)
AdWords
AdSense
Information Retrieval
The structure of a Search Engine
The structureW
eb
Crawler
Page archive
Control
Query
Queryresolver
?
Ranker
PageAnalizer
textStructure
auxiliary
Indexer
Information Retrieval
The Web Graph
The Web’s Characteristics
Size 1 trillion of pages is available (Google 7/08) 5-40K per page => hundreds of terabytes Size grows every day!!
Change 8% new pages, 25% new links change weekly Life time of about 10 days
The Bow Tie
SCCSCC
WCCWCC
Some definitions
Weakly connected components (WCC) Set of nodes such that from any node can go to any node via
an undirected path. Strongly connected components (SCC)
Set of nodes such that from any node can go to any node via a directed path.
On observing the Web graph
We do not know which percentage of it we know
The only way to discover the graph structure of the web as hypertext is via large scale crawls
Warning: the picture might be distorted by Size limitation of the crawl Crawling rules Perturbations of the "natural" process of birth and
death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for Crawl strategies Search Spam detection Discovering communities on the web Classification/organization
Predict the evolution of the Web Sociological understanding
Many other large graphs… Internet graph
V = Routers E = communication links
“Cosine” graph (undirected, weighted) V = static web pages E = tf-idf distance between pages
Query-Log graph (bipartite, weighted) V = queries and URL E = (q,u) where u is a result for q, and has been clicked
by some user who issued q
Social graph (undirected, unweighted) V = users E = (x,y) if x knows y (facebook, address book, email,..)
Definition
Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT)
Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈
2.1
The In-degree distribution
Altavista crawl, 1999 WebBase Crawl 2001
Indegree follows power law distributionk
ku 1
])(degree-inPr[
2.1
Definition
Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈
2.1
Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share
many outgoing lists
A Picture of the Web Graph
i
j
21 millions of pages, 150millions of links
URL-sorting
Stanford
Berkeley
Information Retrieval
Crawling
Spidering
24h, 7days “walking” over a Graph
Recall that the Web graph is
A direct graph G = (N, E)
N changes (insert, delete) >> 50 * 109 nodes
E changes (insert, delete) > 10 links per node
BowTie
Crawling Issues
How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load)
How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?
How often to crawl? Freshness: How much has changed?
How to parallelize the process
Crawling picture
Web
URLs crawledand parsed
URLs frontier
Unseen Web
Seedpages
Sec. 20.2
Updated crawling picture
URLs crawledand parsed
Unseen Web
SeedPages
URL frontier
Crawling thread
Sec. 20.1.1
Robots.txt
Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html
Website announces its request on what can(not) be crawled For a URL, create a file of restrictions URL/robots.txt
Sec. 20.2.1
Robots.txt example
No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine":
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
Disallow:
Sec. 20.2.1
Processing steps in crawling
Pick a URL from the frontier Fetch the document at the URL Parse the URL
Extract links from it to other docs (URLs) For each extracted URL
Ensure it passes certain URL filter tests Check if it is already in the frontier
(duplicate URL elimination) Check if URL has content already seen
Duplicate content elimination
E.g., only crawl .edu, obey robots.txt, etc.
Which one?
Sec. 20.2.1
Basic crawl architecture
WWW
DNS
Parse
Contentseen?
DocFP’s
DupURLelim
URLset
URL Frontier
URLfilter
robotsfilters
Fetch
Sec. 20.2.1
Page selection
Given a page P, define how “good” P is.
Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined
BFS
“…BFS-order discovers the highest quality pages during the early stages of the crawl”
328 millions of URL in the testbed
[Najork 01]
This page is a new one ?
Check if file has been parsed or downloaded before
after 20 mil pages, we have “seen” over 200 million
URLs each URL is at least 100 bytes on average Overall we have about 20Gb of URLS
Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)
Parallel Crawlers
Web is too big to be crawled by a single crawler, work should be divided avoiding duplication
Dynamic assignment Central coordinator dynamically assigns URLs to
crawlers Links are given to Central coordinator
Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web
Two problems
Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail
www.geocities.com/…. www.di.unipi.it/
Dynamic “relocation” schemes may be complicated
Managing the fault-tolerance: What about the death of downloaders ? DD-1, new
hash !!! What about new downloaders ? DD+1, new hash !!!
Let D be the number of downloaders.
hash(URL) maps anURL to {0,...,D-1}.
Dowloader x fetchesthe URLs U s.t. hash(U) = x
A nice technique: Consistent Hashing
A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS
Item and servers mapped to unit circle
Item K assigned to first server N such that ID(N) ≥ ID(K)
What if a downloader goes down?
What if a new downloader appears?Each server gets replicated log S times
[monotone] adding a new server moves points between one old to the new, only.
[balance] Prob item goes to a server is ≤ cost/S
[load] any server gets ≤ (I/S) log S items w.h.p
[scale] you can copy each server more times...
Examples: Open Source
Nutch, also used by WikiSearch http://www.nutch.org
Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html
Consisten Hashing Amazon’s Dynamo
Connectivity Server
Support for fast queries on the web graph Which URLs point to a given URL? Which URLs does a given URL point to?
Stores mappings in memory from URL to outlinks, URL to inlinks
Sec. 20.4
Currently the best
Webgraph – set of algorithms and a java implementation
Fundamental goal – maintain node adjacency lists in memory For this, compressing the adjacency
lists is the critical component
Sec. 20.4
Adjacency lists
The set of neighbors of a node
Assume each URL represented by an integer 4 billion pages 32 bits per node And 64 bits per hyperlink
Sec. 20.4
Adjaceny list compression
Properties exploited in compression: Similarity (between lists) Locality (many links from a page
go to “lexic-nearby” pages) Use gap encodings in sorted lists Distribution of gap values
Sec. 20.4
3 bits/link
Main ideas
Consider lexicographically ordered list of all URLs, e.g., www.stanford.edu/alchemy www.stanford.edu/biology www.stanford.edu/biology/plant www.stanford.edu/biology/plant/copyright www.stanford.edu/biology/plant/people www.stanford.edu/chemistry
Sec. 20.4
Copy lists Each of these URLs has an adjacency list Main idea: due to templates, the adjacency
list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering
Express adjacency list in terms of one of these
E.g., consider these adjacency lists 1, 2, 4, 8, 16, 32, 64 1, 4, 9, 16, 25, 36, 49, 64 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 1, 4, 8, 16, 25, 36, 49, 64
Encode as (-2), 11011111, add 8
Why 7?
Sec. 20.4
Extra nodes and binary arrays
Several tricks: Use RLE over the binary arrays
Use succinct encoding for the intervals created by extra-nodes
Use special interger-codes for the remaining integers code: good for integers from a power law)
Sec. 20.4
Main advantages
Adjacency queries can be answered very efficiently To fetch out-neighbors, trace back the chain of
prototypes This chain is typically short in practice (since
similarity is mostly intra-host) Can also explicitly limit the length of the chain
during encoding
Easy to implement one-pass algorithm
Sec. 20.4
Duplicate documents
The web is full of duplicated content Strict dup-detection = exact match
Not as common
Many cases of near duplicates E.g., Last modified date is the only
difference between two page copies
Sec. 19.6
Duplicate/Near-Duplicate Detection
Duplication: Exact match can be detected with fingerprints
Near-Duplication: Approximate match Overview
Compute syntactic similarity with an edit-distance measure
Use similarity threshold to detect near-duplicates
E.g., Similarity > 80% => Documents are “near duplicates”
Sec. 19.6
Computing Similarity Approach:
Shingles (Word N-Grams) a rose is a rose is a rose → a_rose_is_a rose_is_a_rose is_a_rose_is
a_rose_is_a
Similarity Measure between two docs: Set of shingles + Set intersection
Sec. 19.6
Multiset ofFingerprints
Doc shinglingMultiset ofShingles
fingerprint
Documents Sets of 64-bit fingerprints
Efficient shingle management Fingerprints:• Use Karp-Rabin fingerprints• Use 64-bit fingerprints • Prob[collision] << 1
Similarity of Documents
DocB SB
SADocA
• Jaccard measure – similarity of SA, SB which are sets of integers
•Claim: A & B are near-duplicates if sim(SA,SB) is close to 1
BA
BABA SS
SS )S,sim(S
Remarks
Multiplicities of q-grams – could retain or ignore trade-off efficiency with precision
Shingle Size q [4 … 10] Short shingles: increase similarity of unrelated documents
With q=1, sim(SA,SB) =1 A is permutation of B Need larger q to sensitize to permutation changes
Long shingles: small random changes have larger impact
Similarity Measure Similarity is non-transitive, non-metric But dissimilarity 1-sim(SA,SB) is a metric
[Ukkonen 92] – relate q-gram & edit-distance
Example
A = “a rose is a rose is a rose” B = “a rose is a flower which is a rose” Preserving multiplicity
q=1 sim(SA,SB) = 0.7 SA = {a, a, a, is, is, rose, rose, rose} SB = {a, a, a, is, is, rose, rose, flower, which}
q=2 sim(SA,SB) = 0.5 q=3 sim(SA,SB) = 0.3
Disregarding multiplicity q=1 sim(SA,SB) = 0.6 q=2 sim(SA,SB) = 0.5 q=3 sim(SA,SB) = 0.4285
Efficiency: Sketches
Create a “sketch vector” (of size ~200) for each document Docs that share ≥ t (say 80%) of
elemes in the skecthes are near duplicates
For doc D, sketchD[ i ] is as follows: Let f map all shingles in the universe to
0..2m (e.g., f = fingerprinting) Let i be a random permutation on 0..2m
Pick MIN {i(f(s))} over all shingles s in D
Sec. 19.6
Computing Sketch[i] for Doc1
Document 1
264
264
264
264
Start with 64-bit f(shingles)
Permute on the number linewith i
Pick the min value
Sec. 19.6
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264
264
264
264
Are these equal?
Test for 200 random permutations: , ,… 200
MIN( (f (A) )
Sec. 19.6
same(f(*))
MIN( (f (B) )
Notice that…
Document 1 Document 2
264
264
264
264
264
264
264
264
They are equal iff the shingle with the MIN value in the union of Doc1 and Doc2 is doc in their intersection
Claim: This happens with probability Size_of_intersection / Size_of_union
Sec. 19.6
All signature pairs
This is an efficient method for estimating the similarity (Jaccard coefficient) for one pair of documents.
But we have to estimate N2 similarities, where N is the number of web pages. Still slow One solution: locality sensitive hashing
(LSH) Another solution: sorting
Sec. 19.6
Information Retrieval
Link-based Ranking(2° generation)
Query-independent ordering
First generation: using link counts as simple measures of popularity.
Undirected popularity: Each page gets a score given by the number of in-links
plus the number of out-links (es. 3+2=5).
Directed popularity: Score of a page = number of its in-links (es. 3).
Easy to SPAM
Second generation: PageRank
Each link has its own importance!!
PageRank is
independent of the query
many interpretations…
Basic Intuition…
What about nodes with no in/out links?
Google’s Pagerank
else
jiioutji
0)(#
1
,
B(i)B(i) : set of pages linking to i. : set of pages linking to i.#out(j)#out(j) : number of outgoing links from j. : number of outgoing links from j.ee : vector of components 1/sqrt{N}. : vector of components 1/sqrt{N}.
Random jump
Principaleigenvector
r = [ T + (1-) e eT ] × r
Three different interpretations
Graph (intuitive interpretation) Co-citation
Matrix (easy for computation) Eigenvector computation or a linear system solution
Markov Chain (useful to prove convergence) a sort of Usage Simulation
Any node
Neighbors
“In the steady state” each page has a long-term visit rate - use this as the page’s score.
Pagerank: use in Search Engines
Preprocessing: Given graph, build matrix Compute its principal eigenvector r r[i] is the pagerank of page i
We are interested in the relative order
Query processing: Retrieve pages containing query terms Rank them by their Pagerank
The final order is query-independent
T + (1-) e eT
HITS: Hypertext Induced Topic Search
Calculating HITS
It is query-dependent
Produces two scores per page: Authority score: a good authority page for
a topic is pointed to by many good hubs for that topic.
Hub score: A good hub page for a topic points to many authoritative pages for that topic.
Authority and Hub scores
2
3
4
1 1
5
6
7
a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)
HITS: Link Analysis Computation
Wherea: Vector of Authority’s scores
h: Vector of Hub’s scores. A: Adjacency matrix in which ai,j = 1 if ij
hAAh
AaAa
Aah
hAaT
TT
Thus, h is an eigenvector of AAt
a is an eigenvector of AtA
Symmetricmatrices
Weighting links
Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).
yx
yaxh
)()(
xy
yhxa
)()( )(),()(
)(),()(
yhyxwxa
yayxwxh
xy
yx