compressed data structures for annotated web search soumen chakrabarti sasidhar kasturi bharath...

Compressed Data Structures for Annotated Web Search

Soumen Chakrabarti

Sasidhar KasturiBharath Balakrishnan

Ganesh RamakrishnanRohit Saraf

http://soumen.in/doc/CSAW/

http://soumen.in/doc/CSAW/

2

Searching the annotated Web Search engines increasingly supplement “ten

blue links” using Web of objects From object catalogs like

• WordNet: basic types and common entities• Wikipedia: millions of entities• Freebase: tens of millions of entities• Product catalogs, LinkedIn, IMDB, Zagat …

Several new capabilities required• Recognizing and disambiguating entity mentions• Indexing these mentions along with text• Query execution and entity ranking

3

Lemmas and entities In (Web) text, noisy and

ambiguous lemmas are used to mention entities

Lemma = word or phrase Lemma-to-entity relation is

many-to-many Goal: given mention in

context, find correct entity in catalog, if any

Lemma also called “leaf” because we use a trie to detect mention phrases

Basketball player

Berkeley professor

Country

City that never sleeps

A state in USA

Michael

Michael Jordan

Jordan

Big Apple

New York City

New York

Lemmas Entities

River

4

Features for disambiguationAfter the UNC workshop, Jordan gave a tutorial on nonparametric Bayesian methods.

After a three-season career at UNC, Jordan emerged as a league star with his leaping ability and slam dunks.

wor

ksho

p

nonp

aram

etric

tuto

rial

Bay

esia

n

UN

C

seas

on

leag

ue

leap

slam

dunk

Millions of featuresFeature vectors x

afte

r

5

Each lemma is associated with a set of candidate entities

For each lemma ℓ and each candidate entity e, learn a weight vector w(ℓ,e) in the same space as feature vectors

When deployed to resolve an ambiguity about lemma ℓ, choose

Inferring the correct entity

Linear model; dot product

6

The ℓ, f, e w map Uncompressed key, value takes 12+4 bytes

= 128 bits per entry ~500M entries 8GB just for map No primitive type to hold keys With Java overheads, easily 20GB RAM

• From ~2M to ~100M entities?

Total marginal entropy: 33.6 bits per entry From 128 down to 33.6 and beyond? Must compress keys and values And exploit correlations between them

7

Lossy encoding: signed hash

No need to remember ℓ, f, e w cannot be easily compressed (all buckets same

size for easy hash index) Sign hash ensures expected values preserved Value distortion and disambiguation accuracy

ℓ f e

Hashbuckets

Hash function #1

To insert: w

Hash function #2 ±1

Accumulate ±w into bucket

*

“Training through the collisions” Linear multiclass SVM

• Each class e has model vector we

• From spot generate feature vector x• Predicted class (entity) is

Sign hash space with B buckets Map Predicted class is loses information, SVM training

compensates for it Essential

9

Lossless (ℓ, f ) {e w} organization When scanning documents for

disambiguation, we first encounter lemma ℓ and then features f from context around it

Initialize score accumulator for each candidate entity e

For each feature f in context • Probe data structure with (ℓ, f ) • Retrieve sparse map {e w}• For each entry in map

• Update entity scores

Choose top candidate entity

{e w}

f2ℓ1

f1

{e w}

{e w}

{e w}

“LFE map” or LFEM“LFE map” or LFEM

f3

f4

10

Short entity IDs

Millions of entities globally but few for a given lemma

Use variable length integer codes Frequent short ID has shortest code

Basketball player

CBS, PepsiCo, Westinghouse exec

Machine learning researcher

Mycologist

Racing driver

Goalkeeper

Candidate entitiessorted by decreasing

occurrence frequency in reference corpus

MichaelJordan

Lemma

0

1

2

3

4

5

Short entity IDs wrt lemma

11

Encoding of (ℓ, f ){e w}

f2

ℓ1

f1

e = short ID

• We used code,others may be better

• For adjacent short IDs,we spend only one bit

• Irregular sizes record• Must read from

beginning todecompress

ℓ2Inde

x in

to s

tart

of

segm

ent

for

each

lem

ma

ID

12

Random access on (ℓ, f ) Already support random access on ℓ Number of distinct ℓ in O(10 million) Cannot afford time to decompress from the

beginning of ℓ block Cannot afford (full) index array for (ℓ, f ) Within each ℓ block, allocate sync points Old technique in IR indexing New issues:

• Outer allocation of total sync among ℓ blocks• Tuning syncs to measured (ℓ, f ) probe

distribution — inner allocation

13

Inner sync point allocation policies Say Kℓ sync points budgeted to lemma ℓ To which features can we seek? For others, sequential decode DynProg: optimal expected probe

time with dynamic program Freq: allocate syncs at f with

largest probe prob. p(f |ℓ) Equi: measure off segments

with about equal number of bits EquiAndFreq: split budget

f3

f1

f4

f2f1f3

f5

14

Outer allocation policies Given overall budget K, how many syncs Kℓ

does leaf get?• Hit prob pℓ, bits in leaf segment bℓ

Analytical expression for effect of inner allocation can be intractable

Hit: K ℓ pℓ

HitBit: K ℓ pℓ bℓ

SqrtHitBit: Assume equispaced inner allocation

15

Experiments 500 million pages, mostly English, spam-free Catalog has about two million lemmas and entities

Sam

pler

Trainfold

Testfold

Spotter

Spotter

Testcontexts

Traincontexts

ℓ,f workload

Sm

ooth

er

Disambiguation trainer and cross-validator

ℓ,f(e,w)model

Compressor

L-F-E map

Annotator

Entity andtype

Indexer

Trainfold

Testcontexts

ℓ,f workload

Trainfold

Testfold

Testcontexts

ℓ,f workload

Trainfold

Traincontexts

Trainfold

Testcontexts

ℓ,f workload

Testfold

Cor

pus

Annotationindex

Smoothed ℓ,fdistribution

ℓ,f workload

“Pay

load

”“R

efer

ence

”Our best policycompresses LFEM down to only 18 bits/entry compared to 33.6 bits/entry marginalentropy, and 128 bits/entry raw data

16

Inner policies compared Equi close to

optimal DynProg but fast to compute

Freq surprisingly bad: long tail

Blending Equi and Freq worse than Equi alone

Relative order stable as sample size increased: long tail again

0.E0

4.E10

8.E10

1.2E11

5.E5 1.E6 1.5E6 2.E6 2.5E6 3.E6BudgetCos

t (u

s)

DynProg Equi

EquiAndFreq Freq

0E0

2E10

4E10

6E10

2 4 6 8 10Sample docs (M)Cos

t (u

s)

DynProgEquiEquiAndFreqFreq

Lookup cost: lower is better

17

Diagnosis: Freq vs. Equi

Plots show cumulative seek cost starting at sync• Collapse back to zero at next sync

Features with largest frequency not evenly placed Tail features in between lead to steep seek costs Equi never lets seek cost get out of hand (How about permuting features? See paper)

0

100

200

300

400

0E+0 2E+6 4E+6 6E+6 8E+6FeaturesE

qu

i Co

st0

5000

10000

15000

20000

0E+0 2E+6 4E+6 6E+6 8E+6FeaturesF

req

Co

st

Note scales

18

Outer policies compared

Inner policy set to best (DynProg) SqrtHitBit better than Bit better than HitBit Not surprising, given DynProg behaves closer to

Equi than Freq

0.E0

5.E10

1.E11

1.5E11

2.E11

5.E5 1.E6 1.5E6 2.E6 2.5E6 3.E6Budget->

Cos

t(us

)

Bit

HitBit

SqrtHitBit

Sync budget

Pro

be c

ost

SignHash, no training through collisions Build w from separate

lossless training Distorted from SignHash Most model values

severely distorted Give lossless and

SignHash same RAM Most keys collide Completely unacceptable

accuracy (random guessing is far better)

-200

-150

-100

-50

0

50

100

150

200

0.00 0.20 0.40 0.60 0.80

Relative rank-->

4000000005286798577000000001000000000140000000018000000002100000000

0.3

0.5

0.7

0.9

400 900 1400 1900Hash buckets (millions)

Accuracy

Collision

SignHash, training through collisions Used PEGASOS stochastic gradient descent

for training 77% of spots have label “NA” (no annotation) 23% error by choosing NA for all spots 11% error via lossless LFEM SignHash given same RAM as LFEM 18% error via SignHash Much better than no training But a lot worse than lossless LFEM Surprising, given LFEM currently uses plain

old naïve Bayes

21

Comparison with other systems

Downloaded software or network services Regression removes per-page, per-token overhead LFEM wins, largely because of syncs LFEM RAM << downloaded software

1E0

1E1

1E2

1E3

1E4

1E5

1E6

1E7

1E0 1E1 1E2 1E3 1E4#spots

Pag

e tim

e (m

s)

SpotlightWMinerZemantaLfem

System ms/spot

Spotlight 158

WMiner 21

Zemanta 9.5

LFEM 0.6

LFEM-sync 42

22

Conclusion Compressed in-memory multilevel maps for

disambiguation Random access via tuned sync allocation >20 GB down to 1.15 GB Faster than public disambiguation systems Annotate 500M pages with 2M Wikipedia

entities + index on 408 cores in ~18 hours Sparse models for better storage? Also in the paper: design of compressed

annotation index posting list

compressed data structures for annotated web search soumen chakrabarti sasidhar kasturi bharath...

Documents

e w mapuncompressed

entity relation

candidate entity efor

weight vector w

feature f

mapupdate entity scoreschoose

hash space

easy hash indexsign