type less, find more: fast autocompletion search with a succinct index

48
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber SIGIR 2006 in Seattle, USA, August 6 - 11

Upload: kalona

Post on 05-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Type Less, Find More: Fast Autocompletion Search with a Succinct Index. SIGIR 2006 in Seattle, USA, August 6 - 11. Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber. It's useful. Basic Autocompletion saves typing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Type Less, Find More:Fast Autocompletion Search

with a Succinct Index

Holger BastMax-Planck-Institut für Informatik

Saarbrücken, Germany

joint work with Ingmar Weber

SIGIR 2006 in Seattle, USA, August 6 - 11

Page 2: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Basic Autocompletion

– saves typing

– no more information than necessary

salton

– find out about formulations used

autocomplete, autocompose

– error correction

autocomplit, autocompleet

It's useful

Page 3: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

It's more useful

Complete to phrases

– phrase voronoi diagram → add word voronoi_diagram to index

Complete to subwords

– compound word eigenproblem → add word problem to index

Complete to category names

– author Börkur Sigurbjörnsson → add sigurbjörnson:börkur::author börkur::sigurbjörnson:author

Faceted search

– add ct:conference:sigir

– add ct:author:Börkur_Sigurbjörnson

– add ct:year:2005

all via the same mechanism

Workshop onFaceted Search

on Thursday

Page 4: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Related Engines

Page 5: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Related Engines

Page 6: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Basic Problem Definition

Query

– a set D of documents (= hits for the first part of the query)

– a range W of words (= potential completions of last word)

Answer

– all documents D' from D, containing a word from W

– all words W' from W, contained in a document from D

Extensions (see paper)

– ranking (best hits from D' and best completions from W')

– positional information (proximity queries)

First try: inverted index (INV)

Page 7: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Processing 1-word queries with INV

For example, sigir*

D all documents

W all words matching sigir*

Iterate over all words from W

sigir Doc.18, Doc. 53, Doc. 591, ...

sigir03 Doc. 3, Doc. 66, Doc. 765, ...

sigir04 Doc. 25, Doc. 98, Doc. 221, ...

sigirlist Doc. 67, Doc. 189, Doc. 221, ...

sigirforum Doc. 16, Doc. 110, Doc. 141, ...

Merge the documents lists

D' Doc. 3, Doc. 16, Doc. 18, Doc. 25, …

Output all words from range as completions

W' sigir, sigir03, sigir04, sigirlist, …

Expensive!

Trivialfor 1-word

queries

Page 8: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Processing multi-word queries with INV

For example, sigir* sal*

D Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for sigir*)

W all words matching sal*

Iterate over all words from W

salary Doc. 8, Doc. 23, Doc. 291, ...

salesman Doc. 24, Doc. 36, Doc. 165, ...

salton Doc. 3, Doc. 18, Doc. 66, ...

salutation Doc. 56, Doc. 129, Doc. 251, ...

salvador Doc. 18, Doc. 21, Doc. 25, ...

Intersect each list with D, then merge

D' Doc. 3, Doc. 18, Doc. 25, …

Output all words with non-empty intersection

W' salton, salvador

Most intersection are empty, but

INV has to compute them

all!

Page 9: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

INV — Problems

Asymptotic time complexity is bad (for our problem)

– many intersections (one per potential completion)

– has to merge/sort (the non-empty intersections)

Still hard to beat INV in practice

– highly compressible

half the space on disk means half the time to read it

– INV has very good locality of access

the ratio random access time/sequential access time is 50,000 for disk, and still 100 for main memory

– simple code

instruction cache, branch prediction, etc.

Page 10: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

A Hybrid Index (HYB)

But this looks very wasteful

Basic Idea: have lists for ranges of words

salary – salvador Doc. 3, Doc. 16, Doc.18, Doc. 25, ...

Problem: not enough to show completions

Solution: store the word(s) along with each doc idsalary – salvador Doc. 3, Doc. 16, Doc.18, Doc. 25, ...

salary salvador salton salary

salton salvador

Page 11: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

HYB — Details

HYB has a block for each word range, conceptually:

Replace doc ids by gaps and words by frequency ranks:

1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15D A C A B A C A D A A B C A C A

+1 +2 +0 +2 +0 +1 +1 +1 +0 +1 +2 +0 +0 +1 +1 +23rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st

Encode both gaps and ranks such that x log2 x bits

+0 0 +1 10 +2 110

1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110

10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0

An actual block of HYB

How well does it compress? Which block size?

Page 12: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

INV vs. HYB — Space Consumption

Theorem: The empirical entropy of INV is

Σ ni ∙ (1/ln 2 + log2(n/ni))Theorem: The empirical entropy of HYB with block size ε∙n is

Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))

MEDICINE44,015 docs

263,817 wordswith positions

WIKIPEDIA2,866,503 docs

6,700,119 words

with positions

TREC .GOV25,204,013 docs

25,263,176 words

no positions

raw size 452 MB 7.4 GB 426 GB

INV 13 MB 0.48 GB 4.6 GB

HYB 14 MB 0.51 GB 4.9 GB

Nice match of theory and practice

ni = number of documents containing i-th word, n = number of

documents

Page 13: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

INV vs. HYB — Query Time

MEDICINE44,015 docs

263,817 words5,732 real queries

with proximity

avg : 0.03 secsmax: 0.38 secs

avg : .003 secsmax: 0.06 secs

INV

HYB

WIKIPEDIA2,866,503 docs

6,700,119 words100 random queries

with proximity

avg : 0.17 secsmax: 2.27 secs

avg : 0.05 secsmax: 0.49 secs

Theoretical analysis see paper

Experiment: type ordinary queries from left to right

– sig , sigi , sigir , sigir sal , sigir salt , sigir salto , sigir salton

TREC .GOV25,204,013 docs

25,263,176 words50 TREC queries

no proximity

avg : 0.58 secsmax: 16.83 secs

avg : 0.11 secsmax: 0.86 secs

HYB better by an order of magnitude

Page 14: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

System Design — High Level View

Debugging such an application is hell!

Compute ServerC++

Web ServerPHP

User ClientJavaScript

Page 15: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Summary of Results

Properties of HYB

– highly compressible (just like INV)

– fast prefix-completion queries (perfect locality of access)

– fast indexing (no full inversion necessary)

Autocompletion and more

– phrase and subword completion, semantic completion, XML support, …

– faceted search (Workshop Talk on Thursday)

– efficient DB joins: author[sigir sigmod]NEW

all with one and the same (efficient) mechanism

Page 16: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 17: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 18: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 19: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 20: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 21: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 22: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 23: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 24: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 25: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 26: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 27: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 28: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 29: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 30: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 31: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 32: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 33: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 34: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 35: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 36: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 37: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

INV vs. HYB — Space Consumption

Theorem: H(INV)

Theorem: The empirical entropy of HYB with block size ε∙n is

Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))MED BOOKS

44,015 docs263,817 words

WIKIPEDIA2,866,503 docs

6,700,119 words

TREC .GOV25,204,013 docs

25,263,176 words

raw size 452 MB 7.4 GB 426 GB

INV 13 MB 0.48 GB 4.6 GB

HYB 14 MB 0.51 GB 4.9 GB

Perfect match of theory and practice

ni = number of documents containing i-th word, n = number of

documents

Σ ni ∙ (1/ln 2 +

log2(n/ni))

Definition: empirical entropy H = optimal number of bits

Page 38: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

INV vs. HYB — Space Consumption

Theorem: Entropy(INV) = Σ ni ∙ (1/ln 2 +

log2(n/ni))Theorem: Entropy(HYB) = Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))

MED BOOKS44,015 docs

263,817 words

WIKIPEDIA2,866,503 docs

6,700,119 words

TREC .GOV25,204,013 docs

25,263,176 words

raw size 452 MB 7.4 GB 426 GB

INV 13 MB 0.48 GB 4.6 GB

HYB 14 MB 0.51 GB 4.9 GB

Perfect match of theory and practice

We define a notion of empirical entropy in the paper, in terms of

ni = number of documents containing i-th word, n = number of

documents

Page 39: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

HYB vs. INV — Query Time

MED BOOKS44,015 docs

263,817 words

WIKIPEDIA2,866,503 docs

6,700,119 words

TREC .GOV25,204,013 docs

25,263,176 words

INVavg:0.03 secs avg: 0.17 secs avg: 0.58 secs

max:0.38 secsmax: 2.27 secs max: 16.83 secs

HYBavg:.003 secs avg: 0.05 secs avg: 0.11 secs

max0.06 secsmax: 0.49 secs max: 0.86 secs

Page 40: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Processing a 1-word Query with INV

sigir Doc.18, Doc. 53, Doc. 591, ...

sigir03 Doc. 3, Doc. 66, Doc. 765, ...

sigir04 Doc. 25, Doc. 98, Doc. 221, ...

sigir05 Doc. 57, Doc.99, Doc. 110, ...

sigirlist Doc. 67, Doc. 189, Doc. 221, ...

sigirforum Doc. 16, Doc. 110, Doc. 141, ...

Hits Doc. 3, Doc. 16, Doc. 18, ...

Processing a 1-word query, e.g., sigir*

1. Iterate over all words matching sigir*

2. Merge the documents lists

Completions

sigir, sigir03, sigir04, sigir05, ...

Page 41: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Processing sigir* sal with INV

Iterate over all words matching sigir*

sigir Doc.18, Doc. 53, Doc. 591, ...

sigir03 Doc. 3, Doc. 66, Doc. 765, ...

sigir04 Doc. 25, Doc. 98, Doc. 221, ...

sigirlist Doc. 67, Doc. 189, Doc. 221, ...

sigirforum Doc. 16, Doc. 110, Doc. 141, ...

Merge the documents lists

Hits D' Doc. 3, Doc. 16, Doc. 18, …

Output all words from range as completions

Completions W' sigir, sigir03, sigir05, …

Expensive!

Trivialfor 1-word

queries

Page 42: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Using an Inverted Index (INV)salary Doc.18, Doc. 53, Doc. 591, ...

salesman Doc. 3, Doc. 66, Doc. 765, ...

salient Doc. 25, Doc. 98, Doc. 221, ...

salton Doc. 57, Doc.99, Doc. 110, ...

salutation Doc. 67, Doc. 189, Doc. 221, ...

salvador Doc. 16, Doc. 110, Doc. 141, ...

salvucci Doc. 18, Doc. 25, Doc. 765, ...

salzberg Doc. 53, Doc. 121, Doc. 187, ...

D Doc. 57, Doc 87, Doc. 110, ...

W salary - salzberg

D' Doc. 57, Doc. 110, ...

W' salton, salvador

Problem 1: one intersection per potential completion

Problem 2: merging of non-empty intersections

Page 43: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

HYB — Details

1 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15

+1+2+0+2+0+1+1+1+0+1+2+0+0+1+1+23rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st

+0 0 +1 10 +2 110

1st (A) 0 2nd (C) 10 3rd (D) 111 4th (B) 110

10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0

D A C A B A C A D A A B C A C Awordsdocument ids

gapsranks by frequency

universalencoding:

small gaps/ranks=> short codes

one block of HYB

HYB has a block for each word range

Page 44: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

INV vs. HYB — Query Time

MED BOOKS44,015 docs

263,817 words

avg: 0.03 secsmax: 0.38 secs

avg: .003 secsmax: 0.06 secs

INV

HYB

avg = average time per keystrokemax = maximum time per keystroke (outliers removed)

WIKIPEDIA2,866,503 docs

6,700,119 words

avg: 0.17 secsmax: 2.27 secs

avg: 0.05 secsmax: 0.49 secs

TREC .GOV25,204,013 docs

25,263,176 words

avg: 0.58 secsmax: 16.83 secs

avg: 0.11 secsmax: 0.86 secs

Page 45: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 46: Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Page 47: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Start with DEMO

autocompsigsigir

sigir salsal

Page 48: Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Related Search Engine Features

Complete from precompiled list of queries

– Google Suggest

– AllTheWeb Livesearch

– …

Desktop Search engines

– Apple Spotlight

– Copernic Desktop Search

– …