peer-to-peer search that works, djoerd hiemstra

PEERTOPEER SEARCH THAT WORKS

Djoerd Hiemstrahttp://www.cs.utwente.nl/~hiemstra

Yandex, Moscow, 27 April 2011

2/50

WHAT DOES A SEARCH ENGINE

LOOK LIKE?

?

3/50

A DATA CENTER...?

Goose Creak, California

4/50

A DATA CENTER...?

5/50

A DATA CENTER...?

In Eemshaven... ? Biggest data center in Europe 100,000 servers, 19000 m2, Uses electricity equal to 80.000 households

6/50

A DATA CENTER...?

… where the * is Eemshaven?

Close to a power plant Close to the sea (cooling!)

7/50

WHAT ELSE DOES A SEARCH ENGINE

LOOK LIKE?

?

8/50

A “BIG BROTHER” ?

9/50

A “BIG BROTHER” ?

10/50

NO REALLY, WHAT DOES A SEARCH

ENGINE LOOK LIKE?

?

11/50

… FINDS WHAT YOU NEED ?

12/50


13/50


14/50

SO, NOT NECESSARILY...

Green; environmentally friendly respecting privacy, objective... nor democratic.

15/50

WHAT SHOULD A SEARCH ENGINE

LOOK LIKE?

?

16/50

YOUR PERSONAL SYSTEM:

17/50

PEERTOPEER SEARCH

18/50

YOUR PERSONAL SYSTEM:

Each user brings processing power: As search consumer and search supplier

Green! Democratic No “big brother”

19/50

PEERTOPEER SEARCH

Moscow

Results for “Moscow”

20/50

PEERTOPEER SEARCH

RuSSIRGo to peer 74

21/50

PEERTOPEER SEARCH

RuSSIRGo to peer 74R

uSS

IRR

esul

ts fo

r “R

uSS

R”

22/50

PEERTOPEER SEARCH

RuSSIR

Go to peer 2

23/50

PEERTOPEER SEARCH

RuSSIR

Results for “R

uSSR”

RuSSIR

Go to peer 2

24/50

OVERVIEW

1. Caching in P2P networks

2. Querybased sampling using snippets

3. Deep web querying

25/50

P2P LOAD BALANCING BY CACHING

If you do not index documents, cache them!

Handles query bursts: (e.g., “michael jackson's death”)

26/50

QUERY LOG & CACHING POTENTIAL

27/50

SHARE RATIOS

28/50

CACHE SIZES

29/50

EFFECT OF TEXT PROCESSING

30/50

CHURN

31/50

DISCUSSION

About 55 % from cache in ideal case About 78 % from cache with subsumption,

stemming, etc. About 33 % from cache if bounded cache

and churn (but no subsumption)

32/50

OVERVIEW




33/50

QUERYBASED SAMPLING

Never download any documents Instead, use the search results

snippets to learn about documents

34/50

DO SAMPLES RESEMBLE THE FULL INDEX?

35/50


36/50


37/50

CAN WE DO BETTER THAN RANDOM?

38/50

CAN WE DO BETTER THAN RANDOM?

39/50

DISCUSSION

1. Sampling snippets is as effective as sampling full documents

2. Can be done at no extra costs(!)3. Random sampling is an effective strategy

40/50

OVERVIEW




41/50

DEEP WEB QUERYING

Opportunity: while we are sending queries to search engines directly...… we might as well search the deep web!

42/50

YOUR TYPICAL DEEP WEB SITEYOUR TYPICAL DEEP WEB SITEhttp://www.ns.nlhttp://www.ns.nl

43/50

NATURAL LANGUAGE QUERYING

44/50

EASY TO SPECIFY

45/50

USER STUDY

46/50

USER STUDY

47/50

USER STUDY

A = fromB = toV = viaD = dateT = time

48/50

DISCUSSION

1. Users like the interface2. Users perform the tasks faster3. Considerable query variation between

subjects: No “one size fits all”!

49/50

CONCLUSIONS

Peertopeer is a viable approach to large scale search

Peertopeer search will make Google, Yahoo, Bing and Yandex irrelevant ;)

50/50

PUBLICATIONS Almer Tigelaar, Djoerd Hiemstra, and Dolf Trieschnigg, Search

Result Caching in P2P Information Retrieval Networks, Proceedings of the 2nd Information Retrieval Facility Conference (IRFC), 2011.

Almer Tigelaar and Djoerd Hiemstra, QueryBased Sampling using Snippets, In Proceedings of the SIGIR 2010 Workshop on LargeScale Distributed Systems for Information Retrieval, 2010.

Kien TjinKamJet, Dolf Trieschnigg, and Djoerd Hiemstra, FreeText Search versus Complex Web Forms, Proceedings of the European Conference on Information Retrieval (ECIR), 2011.

51/50

ACKNOWLEDGEMENTS

Netherlands Organization for Scientific Research

Almer Tigelaar Kien TjinKamJet Dolf Trieschnigg

53/50

“MAIL” RESULTS FROM YANDEX ?

peer-to-peer search that works, djoerd hiemstra

Technology