improving the efficiency of multi-site web search engines · multi-site web search engines xiao bai...

17
Improving the Efficiency of Multi-site Web Search Engines Xiao Bai ([email protected] ) Yahoo Labs Joint work with Guillem Francès Medina, B. Barla Cambazoglu and Ricardo Baeza-Yates July 15, 2014

Upload: trinhdieu

Post on 29-Mar-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Improving the Eff ic iency of Mul t i -s i te Web Search Engines

Xiao Bai ( x b a i @ y a h o o - i n c . c o m ) Yahoo Labs Joint work with Guillem Francès Medina, B. Barla Cambazoglu and Ricardo Baeza-Yates

July 15, 2014

Web search is difficult

2

›  Size of the Web •  130+ billions pages •  Constantly changing

›  Cost of data centers •  Hardware investment •  Energy consumption

›  Diversity of users •  Different information need •  Little patience

Multi-site Web search engine

3

›  Fully replicated index •  Easy to implement •  Vertical scalability

›  Partially replicated index •  Faster response •  Horizontal scalability

Challenges in multi-site web search

4

›  Distributed Web crawling •  Which site is the best to crawl a page?

›  Index partitioning •  Which site is the best to index a page?

Crawler

Web

Indexer Query processor

Crawler Indexer Query processor

Crawler Indexer Query processor

Site 1

Site N Site 2

›  Query forwarding •  Which sites contain the best-matching pages?

›  Index replication •  Which pages to replicate in each site?

›  Distributed result caching •  Which site is the best to cache a query result?

Improve query locality Reduce query response time

›  Geographically distributed search sites

›  Document-based index partition •  Easy to build, good load balancing, better fault tolerance •  One document is indexed in only one site •  Document language, domain, server IP, etc.

System architecture

5

Query forwarding

6

Q: w1

(d1, 0.87) (d2, 0.59) (d3, 0.32)

(d4, 0.25) (d5, 0.18)

(d7, 0.70) (d8, 0.23) (d9, 0.07)

(d10, 0.11)

w1

w1

w1

w1

Q: w1,top-2

›  Accurate query forwarding is important •  False negative forwarding: decrease result quality •  False positive forwarding: increase query response time & system workload

›  Threshold-based algorithms •  Thresholds periodically exchanged or learnt from previous query processing

›  Improve result quality by serving documents indexed in remote sites

Machine-learned query forwarder

7

§  m×m binary classifier for each pair of sites

§  A single classifier ›  Pre-retrieval and post-retrieval classifiers: 100 decision trees ›  Pre-retrieval confidence threshold C

? ? ? ?

Q: w1,top-2

No

Pre-retrieval classifier c > C

< F, c > Yes

Post-retrieval classifier

Local query scoreLocal

query processor

Fq

ML query forwarder

FPre-retrieval features

Term lengths Term IDFs Term scores Query language Query popularity Query performance

Post-retrieval features Local query score LP forwarder decision

Performance of machine-learned query forwarder

8

›  Accuracy •  1-FN-FP

›  Query locality •  Fraction of queries without forwarding

Machine-learned

Baseline: Linear programming

Machine-learned

Baseline: Linear programming

Oracle

›  5 sites: 200M web pages + 5M training queries + 2M test queries

Document replication

9

›  Objective •  Select replicated subset of documents that maximizes the fraction of queries whose top-k best-

matching documents are all indexed in local site with a given budget

›  Replication strategies •  Identical replication

–  Global budget •  Individual replication

–  Global budget

•  Individual replication –  Local budget

b% b% b%/4

R(q) b%Si

Identical Individual + Global budget Individual + Local budget

Document selection heuristics

10

›  0-1 knapsack problem •  Utility of document for site

›  Application in different document replication strategies

d ∈ D \ Di Si

ui (d) =

freq(qj )Ri (qj ) × s(d)qj∈Qi

∑ , if d ∈ R(qj )

0 , otherwise

$

%&

'&

u(d) = ui (d)1≤i≤m∑

bm%× size(D)

ui (d), Si

b%× size(D)

ui (d)

b%× size(Di )Budget

Utility

Performance of document replication

11

›  Impact on query forwarding •  Query locality

›  Comparison of different strategies •  Query locality

Individual+Local

Identical

Individual+Global

Result caching

12

›  Improve query locality by caching previously processed query results ›  Basic assumptions

•  Cache with “unlimited” size •  TTL-based invalidation

›  Cache strategies: where to cache a query?

•  Local cache – state-of-the-art –  Mechanism

§  A query is cached in the site it is issued to

§  Local TTL

–  Pros §  Easy to implement

–  Cons §  Redundant processing for popular queries

•  Global cache –  Mechanism

§  A query is cached in all sites

§  TTL w.r.t. the local site receiving the query

–  Pros §  Highest cache hit rate

–  Cons §  Redundant transmission among sites

Result caching strategies

13

›  Cache strategies: where to cache a query?

•  Partial cache –  Mechanism

§  A query is cached in the site it is issued to and the sites it is forwarded to

§  TTL w.r.t. the local site receiving the query

–  Pros

§  Reduce redundant transmission among sites

–  Cons §  Reduce cache hit rate

•  Forward cache –  Mechanism

§  A query is cached in the site it is issued to

§  A pointer to the query is cached in the site receiving the forwarded query

§  TTL w.r.t. the local site receiving the query

–  Pros

§  Further reduce redundant transmission

–  Cons §  Increase query response time

Q: w1 Q: w1 Q: w1 Q: w1

Performance of result caching

14

›  Comparison of different strategies •  Cache hit rate

›  Impact of global cache •  Query locality

User experience

15

›  Result quality •  Centralized top-10 as ground-truth

–  Overlap@p (O@p)

–  NDCG@p (N@p)

–  ExactMatchRate@p (E@p)

•  Caching has no impact on result quality

›  Query response time •  Estimation

–  Processing time: size of index

–  Transmission time: geographical distance

•  Setting –  Identical replication: 8%

–  Global cache: TTL=2h

Conclusions

16

›  First study on the interplay among the key components of multi-site search engines

›  Multi-site search engine is very promising as an alternative to traditional search engines

•  Query forwarding –  Almost the same query response time as Oracle –  Slightly decreases result quality of Oracle

•  Document replication –  Significantly reduces query response time –  Improves result quality

•  Result caching –  Significantly reduces query response time –  No impact on result quality

Thank you!

X i a o B a i ( x b a i @ y a h o o - i n c . c o m )

17

Improving the Eff ic iency of Mul t i -s i te Web Search Engines