applications of data structures and...
TRANSCRIPT
1
Applications of Data Structures and Algorithms
Danfeng YaoCS 16
3/13/2006
2
Overview
Data structures in web interfaceGoogle – Indexing– PageRanking– Crawling
3
Forward/backward buttons of Browser
4
Forward/backward buttons and Stack
www.brown.edu
Back Forward
5
Forward/backward buttons and Stack
www.dam.brown.edu
www.cs.brown.edu
www.brown.edu
www.engin.brown.edu
Back Forward
Old pages
6
Then click the Back button once
www.cs.brown.edu
www.engin.brown.edu
www.brown.edu
www.dam.brown.edu
Back Forward
7
Then click the Forward button once
www.dam.brown.edu
www.cs.brown.edu
www.engin.brown.edu
www.brown.edu
ForwardBack
8
Planar point location: where is the mouse click?
9
Simple point location: Binary decision tree
Build a binary tree– internal nodes corresponding to line segments – external nodes corresponding to regions
a
b
c
a
b
d
below
c
d
below
left
above
above
left
right
right
10
The need for search engines to scale up What a search engine faces– Storage for index files (and maybe
documents themselves too)– Index system processes hundreds of
gigabytes– Queries at a rate of thousands per
secondAdvances in hardware technology– Faster CPU– Cheaper memory and disk space
But still, slow disk seek (~10ms) and operating system instabilityWhat will make today’s search engine scale?
11
Google facts
Will be on the stock market soon – Estimated annual profit $150 million to
$350 million– Estimated annual revenue $500 million to
$1 billion– Estimated market value $12 billion to $20
billion
The heart of Google software is PageRankTM
Google has integrity– No one can buy a higher PageRank
Sergey Brin
Larry Page
12
Data structures in Google– Compact data structures – Avoid disk seeks whenever possible
Data structures for indexing– Link structures– Inverted index
Page rankingCrawling
Google’s approach
13
Web as a directed graph
(Brown, Brown CS)
(Brown CS, CS16)
(Brown CS, rt)
A hyperlink is an edge
A web page is a vertex
14
Storing the link structures
Want to maintain the link relationship of two pages– Used for crawling, ranking, …
Main problem: how to store the set of pairs efficientlyURL is too long and has variable length– Storing (URLi, URLj) has too
much overhead and is slowUse a more compact docID and support fast docID/URLconversion docID URL
http://www.cs.brown.edu/people/rt/
15
Forward index and inverted index
vitae securityinformationdesignalgorithm
algorithm
vitae graphdrawingdesignalgorithm
Forward
Inverted
graph
Hit 3: algorithmHit 2: algorithmHit 1: algorithm
16
PageRanking: bringing order to the web
17
GoogleBot: where to crawl?Through addURL forms or linksDeep crawling– BFS or DFS?– Rumored to crawl in
PageRank orderFresh crawling– Recrawling to keep index
updated
18
More Google facts
Free lunch every day!Bring pets to workOn-site massage
19
Bibliography
The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence PageThe PageRank Citation Ranking: Bringing Order to the Web. Lawrence Page