peer-to-peer search that works, djoerd hiemstra
DESCRIPTION
TRANSCRIPT
PEERTOPEER SEARCH THAT WORKS
Djoerd Hiemstrahttp://www.cs.utwente.nl/~hiemstra
Yandex, Moscow, 27 April 2011
2/50
WHAT DOES A SEARCH ENGINE
LOOK LIKE?
?
3/50
A DATA CENTER...?
Goose Creak, California
4/50
A DATA CENTER...?
5/50
A DATA CENTER...?
In Eemshaven... ? Biggest data center in Europe 100,000 servers, 19000 m2, Uses electricity equal to 80.000 households
6/50
A DATA CENTER...?
… where the * is Eemshaven?
Close to a power plant Close to the sea (cooling!)
7/50
WHAT ELSE DOES A SEARCH ENGINE
LOOK LIKE?
?
8/50
A “BIG BROTHER” ?
9/50
A “BIG BROTHER” ?
10/50
NO REALLY, WHAT DOES A SEARCH
ENGINE LOOK LIKE?
?
11/50
… FINDS WHAT YOU NEED ?
12/50
… FINDS WHAT YOU NEED ?
13/50
… FINDS WHAT YOU NEED ?
14/50
SO, NOT NECESSARILY...
Green; environmentally friendly respecting privacy, objective... nor democratic.
15/50
WHAT SHOULD A SEARCH ENGINE
LOOK LIKE?
?
16/50
YOUR PERSONAL SYSTEM:
17/50
PEERTOPEER SEARCH
18/50
YOUR PERSONAL SYSTEM:
Each user brings processing power: As search consumer and search supplier
Green! Democratic No “big brother”
19/50
PEERTOPEER SEARCH
Moscow
Results for “Moscow”
20/50
PEERTOPEER SEARCH
RuSSIRGo to peer 74
21/50
PEERTOPEER SEARCH
RuSSIRGo to peer 74R
uSS
IRR
esul
ts fo
r “R
uSS
R”
22/50
PEERTOPEER SEARCH
RuSSIR
Go to peer 2
23/50
PEERTOPEER SEARCH
RuSSIR
Results for “R
uSSR”
RuSSIR
Go to peer 2
24/50
OVERVIEW
1. Caching in P2P networks
2. Querybased sampling using snippets
3. Deep web querying
25/50
P2P LOAD BALANCING BY CACHING
If you do not index documents, cache them!
Handles query bursts: (e.g., “michael jackson's death”)
26/50
QUERY LOG & CACHING POTENTIAL
27/50
SHARE RATIOS
28/50
CACHE SIZES
29/50
EFFECT OF TEXT PROCESSING
30/50
CHURN
31/50
DISCUSSION
About 55 % from cache in ideal case About 78 % from cache with subsumption,
stemming, etc. About 33 % from cache if bounded cache
and churn (but no subsumption)
32/50
OVERVIEW
1. Caching in P2P networks
2. Querybased sampling using snippets
3. Deep web querying
33/50
QUERYBASED SAMPLING
Never download any documents Instead, use the search results
snippets to learn about documents
34/50
DO SAMPLES RESEMBLE THE FULL INDEX?
35/50
DO SAMPLES RESEMBLE THE FULL INDEX?
36/50
DO SAMPLES RESEMBLE THE FULL INDEX?
37/50
CAN WE DO BETTER THAN RANDOM?
38/50
CAN WE DO BETTER THAN RANDOM?
39/50
DISCUSSION
1. Sampling snippets is as effective as sampling full documents
2. Can be done at no extra costs(!)3. Random sampling is an effective strategy
40/50
OVERVIEW
1. Caching in P2P networks
2. Querybased sampling using snippets
3. Deep web querying
41/50
DEEP WEB QUERYING
Opportunity: while we are sending queries to search engines directly...… we might as well search the deep web!
42/50
YOUR TYPICAL DEEP WEB SITEYOUR TYPICAL DEEP WEB SITEhttp://www.ns.nlhttp://www.ns.nl
43/50
NATURAL LANGUAGE QUERYING
44/50
EASY TO SPECIFY
45/50
USER STUDY
46/50
USER STUDY
47/50
USER STUDY
A = fromB = toV = viaD = dateT = time
48/50
DISCUSSION
1. Users like the interface2. Users perform the tasks faster3. Considerable query variation between
subjects: No “one size fits all”!
49/50
CONCLUSIONS
Peertopeer is a viable approach to large scale search
Peertopeer search will make Google, Yahoo, Bing and Yandex irrelevant ;)
50/50
PUBLICATIONS Almer Tigelaar, Djoerd Hiemstra, and Dolf Trieschnigg, Search
Result Caching in P2P Information Retrieval Networks, Proceedings of the 2nd Information Retrieval Facility Conference (IRFC), 2011.
Almer Tigelaar and Djoerd Hiemstra, QueryBased Sampling using Snippets, In Proceedings of the SIGIR 2010 Workshop on LargeScale Distributed Systems for Information Retrieval, 2010.
Kien TjinKamJet, Dolf Trieschnigg, and Djoerd Hiemstra, FreeText Search versus Complex Web Forms, Proceedings of the European Conference on Information Retrieval (ECIR), 2011.
51/50
ACKNOWLEDGEMENTS
Netherlands Organization for Scientific Research
Almer Tigelaar Kien TjinKamJet Dolf Trieschnigg
52/50
53/50
“MAIL” RESULTS FROM YANDEX ?