challenges in building large-scale information retrieval …cs6714/20t3/lect/wsdm09-keynote...•...
TRANSCRIPT
![Page 1: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/1.jpg)
Challenges in Building Large-Scale Information Retrieval Systems
Jeff DeanGoogle Fellow
![Page 2: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/2.jpg)
• Challenging blend of science and engineering
– Many interesting, unsolved problems
– Spans many areas of CS:• architecture, distributed systems, algorithms, compression,
information retrieval, machine learning, UI, etc.
– Scale far larger than most other systems
• Small teams can create systems used by hundreds of millions
Why Work on Retrieval Systems?
![Page 3: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/3.jpg)
• Must balance engineering tradeoffs between:
– number of documents indexed
– queries / sec
– index freshness/update rate
– query latency
– information kept about each document
– complexity/cost of scoring/retrieval algorithms
• Engineering difficulty roughly equal to the product of these parameters
• All of these affect overall performance, and performance per $
Retrieval System Dimensions
![Page 4: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/4.jpg)
• # docs: ~70M to many billion
• queries processed/day:
• per doc info in index:
• update latency: months to minutes
• avg. query latency: <1s to <0.2s
• More machines * faster machines:
1999 vs. 2009
![Page 5: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/5.jpg)
• # docs: ~70M to many billion
• queries processed/day:
• per doc info in index:
• update latency: months to minutes
• avg. query latency: <1s to <0.2s
• More machines * faster machines:
1999 vs. 2009
~100X
![Page 6: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/6.jpg)
• # docs: ~70M to many billion
• queries processed/day:
• per doc info in index:
• update latency: months to minutes
• avg. query latency: <1s to <0.2s
• More machines * faster machines:
1999 vs. 2009
~100X
~1000X
![Page 7: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/7.jpg)
• # docs: ~70M to many billion
• queries processed/day:
• per doc info in index:
• update latency: months to minutes
• avg. query latency: <1s to <0.2s
• More machines * faster machines:
1999 vs. 2009
~100X
~3X
~1000X
![Page 8: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/8.jpg)
• # docs: ~70M to many billion
• queries processed/day:
• per doc info in index:
• update latency: months to minutes
• avg. query latency: <1s to <0.2s
• More machines * faster machines:
1999 vs. 2009
~100X
~3X
~1000X
~10000X
![Page 9: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/9.jpg)
• # docs: ~70M to many billion
• queries processed/day:
• per doc info in index:
• update latency: months to minutes
• avg. query latency: <1s to <0.2s
• More machines * faster machines:
1999 vs. 2009
~100X
~3X
~1000X
~10000X
~5X
![Page 10: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/10.jpg)
• # docs: ~70M to many billion
• queries processed/day:
• per doc info in index:
• update latency: months to minutes
• avg. query latency: <1s to <0.2s
• More machines * faster machines:
1999 vs. 2009
~100X
~3X
~1000X
~10000X
~5X
~1000X
![Page 11: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/11.jpg)
• Parameters change over time
– often by many orders of magnitude
• Right design at X may be very wrong at 10X or 100X– ... design for ~10X growth, but plan to rewrite before ~100X
• Continuous evolution:
– 7 significant revisions in last 10 years
– often rolled out without users realizing we’ve made major changes
Constant Change
![Page 12: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/12.jpg)
• Evolution of Google’s search systems
– several gens of crawling/indexing/serving systems
– brief description of supporting infrastructure
– Joint work with many, many people
• Interesting directions and challenges
Rest of Talk
![Page 13: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/13.jpg)
“Google” Circa 1997 (google.stanford.edu)
![Page 14: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/14.jpg)
Research Project, circa 1997
!"#$%&$'!(&)!*&"+&"
,- ,. ,/ ,0
,$'&1!234"'2
5- 5. 56
78&"9
,$'&1!2&"+&"2 5#:!2&"+&"2
; ;5#:!234"'2
![Page 15: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/15.jpg)
• By doc: each shard has index for subset of docs– pro: each shard can process queries independently
– pro: easy to keep additional per-doc information
– pro: network traffic (requests/responses) small
– con: query has to be processed by each shard
– con: O(K*N) disk seeks for K word query on N shards
Ways of Index Partitioning
![Page 16: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/16.jpg)
• By doc: each shard has index for subset of docs– pro: each shard can process queries independently
– pro: easy to keep additional per-doc information
– pro: network traffic (requests/responses) small
– con: query has to be processed by each shard
– con: O(K*N) disk seeks for K word query on N shards
Ways of Index Partitioning
• By word: shard has subset of words for all docs– pro: K word query => handled by at most K shards
– pro: O(K) disk seeks for K word query
– con: much higher network bandwidth needed• data about each word for each matching doc must be collected in
one place
– con: harder to have per-doc information
![Page 17: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/17.jpg)
Ways of Index Partitioning
In our computing environment, by doc makes more sense
![Page 18: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/18.jpg)
• Documents assigned small integer ids (docids)
– good if smaller for higher quality/more important docs
• Index Servers:
– given (query) return sorted list of (score, docid, ...)
– partitioned (“sharded”) by docid
– index shards are replicated for capacity
– cost is O(# queries * # docs in index)
• Doc Servers
– given (docid, query) generate (title, snippet)
– map from docid to full text of docs on disk
– also partitioned by docid
– cost is O(# queries)
Basic Principles
![Page 19: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/19.jpg)
“Corkboards” (1999)
![Page 20: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/20.jpg)
Serving System, circa 1999
!"#$%&$'!(&)!*&"+&"
,- ,. ,/ ,0
,- ,. ,/ ,0
,- ,. ,/ ,0
<&=>?:42 ;
;
,$'&1!234"'2
5- 5. 56
5- 5. 56
5- 5. 56<&=>?:42 ;
;5#:!234"'2
78&"9
,$'&1!2&"+&"2
@4:3&!2&"+&"2
@- @. @A;
5#:!2&"+&"2
B'!*92%&C
![Page 21: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/21.jpg)
• Cache servers:
– cache both index results and doc snippets
– hit rates typically 30-60%• depends on frequency of index updates, mix of query traffic,
level of personalization, etc
• Main benefits:
– performance! 10s of machines do work of 100s or 1000s
– reduce query latency on hits
• queries that hit in cache tend to be both popular and expensive (common words, lots of documents to score, etc.)
• Beware: big latency spike/capacity drop when index updated or cache flushed
Caching
![Page 22: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/22.jpg)
• Simple batch crawling system
– start with a few URLs
– crawl pages
– extract links, add to queue
– stop when you have enough pages
• Concerns:
– don’t hit any site too hard
– prioritizing among uncrawled pages• one way: continuously compute PageRank on changing graph
– maintaining uncrawled URL queue efficiently• one way: keep in a partitioned set of servers
– dealing with machine failures
Crawling (circa 1998-1999)
![Page 23: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/23.jpg)
• Simple batch indexing system
– Based on simple unix tools
– No real checkpointing, so machine failures painful
– No checksumming of raw data, so hardware bit errors caused problems
• Exacerbated by early machines having no ECC, no parity
• Sort 1 TB of data without parity: ends up "mostly sorted"
• Sort it again: "mostly sorted" another way
• “Programming with adversarial memory”
– Led us to develop a file abstraction that stored checksums of small records and could skip and resynchronize after corrupted records
Indexing (circa 1998-1999)
![Page 24: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/24.jpg)
• 1998-1999: Index updates (~once per month):
– Wait until traffic is low
– Take some replicas offline
– Copy new index to these replicas
– Start new frontends pointing at updated index and serve some traffic from there
Index Updates (circa 1998-1999)
![Page 25: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/25.jpg)
• Index server disk:
– outer part of disk gives higher disk bandwidth
Index Updates (circa 1998-1999)
D:%EE!?$'&1
![Page 26: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/26.jpg)
• Index server disk:
– outer part of disk gives higher disk bandwidth
Index Updates (circa 1998-1999)
D:%EE!?$'&1
0#+EE!?$'&1
.F!@#=9!$&G!?$'&1!%#!?$$&"!34>H!#H!'?2IJG3?>&!2%?>>!2&"+?$K!#>'!?$'&1L
/F!<&2%4"%!%#!82&!$&G!?$'&1
![Page 27: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/27.jpg)
• Index server disk:
– outer part of disk gives higher disk bandwidth
Index Updates (circa 1998-1999)
0#+EE!?$'&1
.F!@#=9!$&G!?$'&1!%#!?$$&"!34>H!#H!'?2IJG3?>&!2%?>>!2&"+?$K!#>'!?$'&1L
/F!<&2%4"%!%#!82&!$&G!?$'&1
MF!(?=&!#>'!?$'&1
![Page 28: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/28.jpg)
• Index server disk:
– outer part of disk gives higher disk bandwidth
Index Updates (circa 1998-1999)
0#+EE!?$'&1
0#+EE!?$'&1
.F!@#=9!$&G!?$'&1!%#!?$$&"!34>H!#H!'?2IJG3?>&!2%?>>!2&"+?$K!#>'!?$'&1L
/F!<&2%4"%!%#!82&!$&G!?$'&1
MF!(?=&!#>'!?$'&1
NF!<&O:#=9!$&G!?$'&1!%#!H42%&"!34>H!#H!'?2I
![Page 29: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/29.jpg)
• Index server disk:
– outer part of disk gives higher disk bandwidth
Index Updates (circa 1998-1999)
0#+EE!?$'&1.F!@#=9!$&G!?$'&1!%#!?$$&"!34>H!#H!'?2IJG3?>&!2%?>>!2&"+?$K!#>'!?$'&1L
/F!<&2%4"%!%#!82&!$&G!?$'&1
MF!(?=&!#>'!?$'&1
PF!(?=&!H?"2%!:#=9!#H!$&G!?$'&1
NF!<&O:#=9!$&G!?$'&1!%#!H42%&"!34>H!#H!'?2I
![Page 30: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/30.jpg)
• Index server disk:
– outer part of disk gives higher disk bandwidth
Index Updates (circa 1998-1999)
0#+EE!?$'&1.F!@#=9!$&G!?$'&1!%#!?$$&"!34>H!#H!'?2IJG3?>&!2%?>>!2&"+?$K!#>'!?$'&1L
/F!<&2%4"%!%#!82&!$&G!?$'&1
MF!(?=&!#>'!?$'&1
PF!(?=&!H?"2%!:#=9!#H!$&G!?$'&1
QF!,$$&"!34>H!$#G!H"&&!H#"!)8?>'?$K!+4"?#82=&"H#"C4$:&!?C="#+?$K!'4%4!2%"8:%8"&2
NF!<&O:#=9!$&G!?$'&1!%#!H42%&"!34>H!#H!'?2I
R4?"!:4:3&S!="&O?$%&"2&:%&'!=4?"2!#H!=#2%?$K!>?2%2!H#"!:#CC#$>9!:#O#::8""?$K!78&"9!%&"C2!J&FKF!T$&GU!4$'!T9#"IUV!#"!T)4":&>#$4U!4$'!T"&2%48"4$%2UL
0#+EE=4?"!:4:3&
![Page 31: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/31.jpg)
Google Data Center (2000)
![Page 32: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/32.jpg)
Google Data Center (2000)
![Page 33: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/33.jpg)
Google Data Center (2000)
![Page 34: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/34.jpg)
Google (new data center 2001)
![Page 35: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/35.jpg)
Google Data Center (3 days later)
![Page 36: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/36.jpg)
• Huge increases in index size in ’99, ’00, ’01, ...– From ~50M pages to more than 1000M pages
• At same time as huge traffic increases– ~20% growth per month in 1999, 2000, ...
– ... plus major new partners (e.g. Yahoo in July 2000 doubled traffic overnight)
• Performance of index servers was paramount– Deploying more machines continuously, but...
– Needed ~10-30% software-based improvement every month
Increasing Index Size and Query Capacity
![Page 37: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/37.jpg)
Dealing with Growth
!"#$%&$'!(&)!*&"+&"
78&"9
,$'&1!2&"+&"2
@4:3&!2&"+&"2
B'!*92%&C
5#:!*&"+&"2
@4:3&!*&"+&"2<&=>?:42
,$'&1!234"'2
,- ,. ,/
,- ,. ,/
![Page 38: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/38.jpg)
Dealing with Growth
!"#$%&$'!(&)!*&"+&"
78&"9
,$'&1!2&"+&"2
@4:3&!2&"+&"2
B'!*92%&C
5#:!*&"+&"2
@4:3&!*&"+&"2<&=>?:42
,$'&1!234"'2
,- ,. ,/
,- ,. ,/
,M
,M
![Page 39: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/39.jpg)
Dealing with Growth
!"#$%&$'!(&)!*&"+&"
78&"9
,$'&1!2&"+&"2
@4:3&!2&"+&"2
B'!*92%&C
5#:!*&"+&"2
@4:3&!*&"+&"2<&=>?:42
,$'&1!234"'2
,- ,. ,/
,- ,. ,/
,M
,M
,- ,. ,/ ,M
![Page 40: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/40.jpg)
Dealing with Growth
!"#$%&$'!(&)!*&"+&"
78&"9
,$'&1!2&"+&"2
@4:3&!2&"+&"2
B'!*92%&C
5#:!*&"+&"2
@4:3&!*&"+&"2<&=>?:42
,$'&1!234"'2
,- ,. ,/
,- ,. ,/
,M
,M
,- ,. ,/ ,M
,N
,N
,N
![Page 41: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/41.jpg)
Dealing with Growth
!"#$%&$'!(&)!*&"+&"
78&"9
,$'&1!2&"+&"2
@4:3&!2&"+&"2
B'!*92%&C
5#:!*&"+&"2
@4:3&!*&"+&"2<&=>?:42
,$'&1!234"'2
,- ,. ,/
,- ,. ,/
,M
,M
,- ,. ,/ ,M
,N
,N
,N
,.-;,.-;,.-;
![Page 42: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/42.jpg)
Dealing with Growth
!"#$%&$'!(&)!*&"+&"
78&"9
,$'&1!2&"+&"2
@4:3&!2&"+&"2
B'!*92%&C
5#:!*&"+&"2
@4:3&!*&"+&"2<&=>?:42
,$'&1!234"'2
,- ,. ,/
,- ,. ,/
,M
,M
,- ,. ,/ ,M
,N
,N
,N
,.-;,.-;,.-;
,- ,. ,/ ,M ,N ,.-;
; ;
![Page 43: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/43.jpg)
Dealing with Growth
!"#$%&$'!(&)!*&"+&"
78&"9
,$'&1!2&"+&"2
@4:3&!2&"+&"2
B'!*92%&C
5#:!*&"+&"2
@4:3&!*&"+&"2<&=>?:42
,$'&1!234"'2
,- ,. ,/
,- ,. ,/
,M
,M
,- ,. ,/ ,M
,N
,N
,N
,.-;,.-;,.-;
,- ,. ,/ ,M ,N ,.-;
; ;
,Q-;,Q-;,Q-;
,Q-;
;
![Page 44: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/44.jpg)
• Shard response time influenced by:
– # of disk seeks that must be done
– amount of data to be read from disk
• Big performance improvements possible with:
– better disk scheduling
– improved index encoding
Implications
![Page 45: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/45.jpg)
• Original encoding (’97) was very simple:
– hit: position plus attributes (font size, title, etc.)
– Eventually added skip tables for large posting lists
• Simple, byte aligned format
– cheap to decode, but not very compact
– ... required lots of disk bandwidth
Index Encoding circa 1997-1999
'#:?'W$3?%2SM/) 3?%S!.Q) 3?%S!.Q) ; 3?%S!.Q)'#:?'W$3?%2SM/)WORD
![Page 46: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/46.jpg)
• Bit-level encodings:
– Unary: N ‘1’s followed by a ‘0’
– Gamma: log2(N) in unary, then floor(log2(N)) bits
– RiceK: floor(N / 2K) in unary, then N mod 2K in K bits
• special case of Golomb codes where base is power of 2
– Huffman-Int: like Gamma, except log2(N) is Huffman coded instead of encoded w/ Unary
• Byte-aligned encodings:
– varint: 7 bits per byte with a continuation bit
• 0-127: 1 byte, 128-4095: 2 bytes, ...
– ...
Encoding Techniques
![Page 47: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/47.jpg)
Block-Based Index Format
• Block-based, variable-len format reduced both space and CPU
'&>%4!%#!>42%!'#:?'!?$!)>#:IS!+4"?$%
&$:#'?$K!%9=&S!X4CC4 Y!'#:2!?$!)>#:IS!X4CC4
0!O.!'#:?'!'&>%42S!<?:&I!:#'&'
0!+4>8&2!#H!Y!3?%2!=&"!'#:S!X4CC4
Z!3?%!4%%"?)8%&2S!"8$!>&$K%3!Z8HHC4$!&$:#'&'
Z!3?%!=#2?%?#$2S!Z8HHC4$O,$%!&$:#'&'
)>#:I!>&$K%3S!+4"?$%
• Reduced index size by ~30%, plus much faster to decode
[>#:I!- [>#:I!. ;[>#:I!/ [>#:I!0*I?=!%4)>&J?H!>4"K&LWORD
[>#:I!H#"C4%!JG?%3!0!'#:8C&$%2!4$'!Z!3?%2LS
[9%&!4>?K$&'!3&4'&"
![Page 48: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/48.jpg)
• Must add shards to keep response time low as index size increases
• ... but query cost increases with # of shards
– typically >= 1 disk seek / shard / query term
– even for very rare terms
• As # of replicas increases, total amount of memory available increases
– Eventually, have enough memory to hold an entire copy of the index in memory
• radically changes many design parameters
Implications of Ever-Wider Sharding
![Page 49: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/49.jpg)
Early 2001: In-Memory Index
!"#$%&$'!(&)!*&"+&"
78&"9
,$'&1!2&"+&"2
@4:3&!2&"+&"2
B'!*92%&C
5#:!*&"+&"2
@4:3&!*&"+&"2
,$'&1!234"'2
*34"'!-
,- ,. ,/
,.N
,M
,./
)4>
,N ,P
;,.M
*34"'!.
,- ,. ,/
,.N
,M
,./
)4>
,N ,P
;,.M
*34"'!/
,- ,. ,/
,.N
,M
,./
)4>
,N ,P
;,.M
*34"'!0
,- ,. ,/
,.N
,M
,./
)4>
,N ,P
;,.M
[4>4$:&"2
;
![Page 50: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/50.jpg)
• Many positives:
– big increase in throughput
– big decrease in latency• especially at the tail: expensive queries that previously
needed GBs of disk I/O became much faster
e.g. [ “circle of life” ]
• Some issues:
– Variance: touch 1000s of machines, not dozens
• e.g. randomized cron jobs caused us trouble for a while
– Availability: 1 or few replicas of each doc’s index data
• Queries of death that kill all the backends at once: very bad
• Availability of index data when machine failed (esp for important docs): replicate important docs
In-Memory Indexing Systems
![Page 51: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/51.jpg)
Larger-Scale Computing
![Page 52: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/52.jpg)
• In-house rack design
• PC-class motherboards
• Low-end storage and networking hardware
• Linux
• + in-house software
Current Machines
![Page 53: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/53.jpg)
Serving Design, 2004 edition
Root
…
…
ParentServers
……
LeafServers
Repository Shards
…
RepositoryManager
FileLoaders
Cache servers
Requests
GFS
![Page 54: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/54.jpg)
New Index Format
• Block index format used two-level scheme:
– Each hit was encoded as (docid, word position in doc) pair
– Docid deltas encoded with Rice encoding
– Very good compression (originally designed for disk-based indices), but slow/CPU-intensive to decode
• New format: single flat position space
– Data structures on side keep track of doc boundaries
– Posting lists are just lists of delta-encoded positions
– Need to be compact (can’t afford 32 bit value per occurrence)
– … but need to be very fast to decode
![Page 55: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/55.jpg)
Byte-Aligned Variable-length Encodings
• Varint encoding:– 7 bits per byte with continuation bit
– Con: Decoding requires lots of branches/shifts/masks
0000111100000001 0000001111111111 1111111111111111 00000111
1 15 511 131071
![Page 56: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/56.jpg)
Byte-Aligned Variable-length Encodings
• Varint encoding:– 7 bits per byte with continuation bit
– Con: Decoding requires lots of branches/shifts/masks
0000111100000001 0000001111111111 1111111111111111 00000111
1 15 511 131071
• Idea: Encode byte length as low 2 bits– Better: fewer branches, shifts, and masks
– Con: Limited to 30-bit values, still some shifting to decode
0000111100000001 0000001101111111 1111111110111111 00000111
1 15 511 131071
![Page 57: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/57.jpg)
Group Varint Encoding
• Idea: encode groups of 4 values in 5-17 bytes– Pull out 4 2-bit binary lengths into single byte prefix
![Page 58: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/58.jpg)
Group Varint Encoding
• Idea: encode groups of 4 values in 5-17 bytes– Pull out 4 2-bit binary lengths into single byte prefix
0000111100000001 0000001101111111 1111111110111111 00000111
![Page 59: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/59.jpg)
Group Varint Encoding
• Idea: encode groups of 4 values in 5-17 bytes– Pull out 4 2-bit binary lengths into single byte prefix
0000111100000001 0000001101111111 1111111110111111 00000111
00000110
Tags
![Page 60: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/60.jpg)
Group Varint Encoding
• Idea: encode groups of 4 values in 5-17 bytes– Pull out 4 2-bit binary lengths into single byte prefix
0000111100000001 0000001101111111 1111111110111111 00000111
0000111100000001 11111111 11111111
1 15 511 131071
00000001 11111111 0000000100000110
Tags
![Page 61: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/61.jpg)
Group Varint Encoding
• Idea: encode groups of 4 values in 5-17 bytes– Pull out 4 2-bit binary lengths into single byte prefix
0000111100000001 0000001101111111 1111111110111111 00000111
0000111100000001 11111111 11111111
1 15 511 131071
00000001 11111111 0000000100000110
Tags
• Decode: Load prefix byte and use value to lookup in 256-entry table:
![Page 62: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/62.jpg)
Group Varint Encoding
• Idea: encode groups of 4 values in 5-17 bytes– Pull out 4 2-bit binary lengths into single byte prefix
0000111100000001 0000001101111111 1111111110111111 00000111
0000111100000001 11111111 11111111
1 15 511 131071
00000001 11111111 0000000100000110
Tags
• Decode: Load prefix byte and use value to lookup in 256-entry table:
00000110 Offsets: +1,+2,+3,+5; Masks: ff, ff, ffff, ffffff;;
![Page 63: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/63.jpg)
Group Varint Encoding
• Idea: encode groups of 4 values in 5-17 bytes– Pull out 4 2-bit binary lengths into single byte prefix
0000111100000001 0000001101111111 1111111110111111 00000111
0000111100000001 11111111 11111111
1 15 511 131071
00000001 11111111 0000000100000110
Tags
• Much faster than alternatives:– 7-bit-per-byte varint: decode ~180M numbers/second
– 30-bit Varint w/ 2-bit length: decode ~240M numbers/second
– Group varint: decode ~400M numbers/second
• Decode: Load prefix byte and use value to lookup in 256-entry table:
00000110 Offsets: +1,+2,+3,+5; Masks: ff, ff, ffff, ffffff;;
![Page 64: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/64.jpg)
2007: Universal Search
!"#$%&$'!(&)!*&"+&"
78&"9
@4:3&!2&"+&"2
B'!*92%&C
0&G2
*8=&"!"##%
,C4K&2
(&)
[>#K2\?'&#
[##I2
]#:4>
Indexing Service
![Page 65: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/65.jpg)
• Low-latency crawling and indexing is tough
– crawl heuristics: what pages should be crawled?
– crawling system: need to crawl pages quickly
– indexing system: depends on global data
• PageRank, anchor text of pages that point to the page, etc.
• must have online approximations for these global properties
– serving system: must be prepared to accept updates while serving requests
• very different data structures than batch update serving system
Index that? Just a minute!
![Page 66: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/66.jpg)
• Ease of experimentation hugely important
– faster turnaround => more exps => more improvement
• Some experiments are easy
– e.g. just weight existing data differently
• Others are more difficult to perform: need data not present in production index
– Must be easy to generate and incorporate new data and use it in experiments
Flexibility & Experimentation in IR Systems
![Page 67: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/67.jpg)
• Several key pieces of infrastructure:
– GFS: large-scale distributed file system
– MapReduce: makes it easy to write/run large scale jobs
• generate production index data more quickly
• perform ad-hoc experiments more rapidly
• ...
– BigTable: semi-structured storage system
• online, efficient access to per-document information at any time
• multiple processes can update per-doc info asynchronously
• critical for updating documents in minutes instead of hours
http://labs.google.com/papers/gfs.html
http://labs.google.com/papers/mapreduce.html
http://labs.google.com/papers/bigtable.html
Infrastructure for Search Systems
![Page 68: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/68.jpg)
• Start with new ranking idea
• Must be easy to run experiments, and to do so quickly:– Use tools like MapReduce, BigTable, to generate data...
– Initially, run off-line experiment & examine effects
• ...on human-rated query sets of various kinds
• ...on random queries, to look at changes to existing ranking
– Latency and throughput of this prototype don't matter
• ...iterate, based on results of experiments ...
Experimental Cycle, Part 1
![Page 69: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/69.jpg)
• Once off-line experiments look promising, want to run live experiment
– Experiment on tiny sliver of user traffic
– Random sample, usually• but sometimes a sample of specific class of queries
– e.g. English queries, or queries with place names, etc.
• For this, throughput not important, but latency matters
– Experimental framework must operate at close to production latencies!
Experimental Cycle, Part 2
![Page 70: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/70.jpg)
• Launch!
• Performance tuning/reimplementation to make feasible at full load– e.g. precompute data rather than computing at runtime
– e.g. approximate if "good enough" but much cheaper
• Rollout process important:
– Continuously make quality vs. cost tradeoffs
– Rapid rollouts at odds with low latency and site stability
• Need good working relationships between search quality and groups chartered to make things fast and stable
Experiment Looks Good: Now What?
![Page 71: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/71.jpg)
• A few closing thoughts on interesting directions...
Future Directions & Challenges
![Page 72: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/72.jpg)
• Translate all the world’s documents to all the world’s languages
– increases index size substantially
– computationally expensive
– ... but huge benefits if done well
• Challenges:
– continuously improving translation quality
– large-scale systems work to deal with larger and more complex language models
• to translate one sentence ⇒ ~1M lookups in multi-TB model
Cross-Language Information Retrieval
![Page 73: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/73.jpg)
ACLs in Information Retrieval Systems
• Retrieval systems with mix of private, semi-private, widely shared and public documents
– e.g. e-mail vs. shared doc among 10 people vs. messages in group with 100,000 members vs. public web pages
• Challenge: building retrieval systems that efficiently deal with ACLs that vary widely in size
– best solution for doc shared with 10 people is different than for doc shared with the world
– sharing patterns of a document might change over time
![Page 74: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/74.jpg)
Automatic Construction of Efficient IR Systems
• Currently use several retrieval systems
– e.g. one system for sub-second update latencies, one for very large # of documents but daily updates, ...
– common interfaces, but very different implementations primarily for efficiency
– works well, but lots of effort to build, maintain and extend different systems
• Challenge: can we have a single parameterizable system that automatically constructs efficient retrieval system based on these parameters?
![Page 75: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/75.jpg)
Information Extraction from Semi-structured Data
• Data with clearly labelled semantic meaning is a tiny fraction of all the data in the world
• But there’s lots semi-structured data
– books & web pages with tables, data behind forms, ...
• Challenge: algorithms/techniques for improved extraction of structured information from unstructured/semi-structured sources
– noisy data, but lots of redundancy
– want to be able to correlate/combine/aggregate info from different sources
![Page 76: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/76.jpg)
In Conclusion...
• Designing and building large-scale retrieval systems is a challenging, fun endeavor
– new problems require continuous evolution
– work benefits many users
– new retrieval techniques often require new systems
• Thanks for your attention!
![Page 77: Challenges in Building Large-Scale Information Retrieval …cs6714/20t3/lect/WSDM09-keynote...• Challenging blend of science and engineering – Many interesting, unsolved problems](https://reader035.vdocuments.site/reader035/viewer/2022071421/611ad231d6c77f53c63c90b4/html5/thumbnails/77.jpg)
Thanks! Questions...?
• Further reading:Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003.
Barroso, Dean, & Hölzle. Web Search for a Planet: The Google Cluster Architecture, IEEE
Micro, 2003.
Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004.
Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed Storage System for Structured Data, OSDI 2006.
Brants, Popat, Xu, Och, & Dean. Large Language Models in Machine Translation, EMNLP 2007.
• These and many more available at:
http://labs.google.com/papers.html