compressed data structures for annotated web search soumen chakrabarti sasidhar kasturi bharath...
TRANSCRIPT
Compressed Data Structures for Annotated Web Search
Soumen Chakrabarti
Sasidhar KasturiBharath Balakrishnan
Ganesh RamakrishnanRohit Saraf
http://soumen.in/doc/CSAW/
2
Searching the annotated Web Search engines increasingly supplement “ten
blue links” using Web of objects From object catalogs like
• WordNet: basic types and common entities• Wikipedia: millions of entities• Freebase: tens of millions of entities• Product catalogs, LinkedIn, IMDB, Zagat …
Several new capabilities required• Recognizing and disambiguating entity mentions• Indexing these mentions along with text• Query execution and entity ranking
3
Lemmas and entities In (Web) text, noisy and
ambiguous lemmas are used to mention entities
Lemma = word or phrase Lemma-to-entity relation is
many-to-many Goal: given mention in
context, find correct entity in catalog, if any
Lemma also called “leaf” because we use a trie to detect mention phrases
Basketball player
Berkeley professor
Country
City that never sleeps
A state in USA
Michael
Michael Jordan
Jordan
Big Apple
New York City
New York
Lemmas Entities
River
4
Features for disambiguationAfter the UNC workshop, Jordan gave a tutorial on nonparametric Bayesian methods.
After a three-season career at UNC, Jordan emerged as a league star with his leaping ability and slam dunks.
wor
ksho
p
nonp
aram
etric
tuto
rial
Bay
esia
n
UN
C
seas
on
leag
ue
leap
slam
dunk
Millions of featuresFeature vectors x
afte
r
5
Each lemma is associated with a set of candidate entities
For each lemma ℓ and each candidate entity e, learn a weight vector w(ℓ,e) in the same space as feature vectors
When deployed to resolve an ambiguity about lemma ℓ, choose
Inferring the correct entity
Linear model; dot product
6
The ℓ, f, e w map Uncompressed key, value takes 12+4 bytes
= 128 bits per entry ~500M entries 8GB just for map No primitive type to hold keys With Java overheads, easily 20GB RAM
• From ~2M to ~100M entities?
Total marginal entropy: 33.6 bits per entry From 128 down to 33.6 and beyond? Must compress keys and values And exploit correlations between them
7
Lossy encoding: signed hash
No need to remember ℓ, f, e w cannot be easily compressed (all buckets same
size for easy hash index) Sign hash ensures expected values preserved Value distortion and disambiguation accuracy
ℓ f e
Hashbuckets
Hash function #1
To insert: w
Hash function #2 ±1
Accumulate ±w into bucket
*
“Training through the collisions” Linear multiclass SVM
• Each class e has model vector we
• From spot generate feature vector x• Predicted class (entity) is
Sign hash space with B buckets Map Predicted class is loses information, SVM training
compensates for it Essential
9
Lossless (ℓ, f ) {e w} organization When scanning documents for
disambiguation, we first encounter lemma ℓ and then features f from context around it
Initialize score accumulator for each candidate entity e
For each feature f in context • Probe data structure with (ℓ, f ) • Retrieve sparse map {e w}• For each entry in map
• Update entity scores
Choose top candidate entity
{e w}
f2ℓ1
f1
{e w}
{e w}
{e w}
“LFE map” or LFEM“LFE map” or LFEM
f3
f4
10
Short entity IDs
Millions of entities globally but few for a given lemma
Use variable length integer codes Frequent short ID has shortest code
Basketball player
CBS, PepsiCo, Westinghouse exec
Machine learning researcher
Mycologist
Racing driver
Goalkeeper
Candidate entitiessorted by decreasing
occurrence frequency in reference corpus
MichaelJordan
Lemma
0
1
2
3
4
5
Short entity IDs wrt lemma
11
Encoding of (ℓ, f ){e w}
f2
ℓ1
f1
e = short ID
• We used code,others may be better
• For adjacent short IDs,we spend only one bit
• Irregular sizes record• Must read from
beginning todecompress
ℓ2Inde
x in
to s
tart
of
segm
ent
for
each
lem
ma
ID
12
Random access on (ℓ, f ) Already support random access on ℓ Number of distinct ℓ in O(10 million) Cannot afford time to decompress from the
beginning of ℓ block Cannot afford (full) index array for (ℓ, f ) Within each ℓ block, allocate sync points Old technique in IR indexing New issues:
• Outer allocation of total sync among ℓ blocks• Tuning syncs to measured (ℓ, f ) probe
distribution — inner allocation
13
Inner sync point allocation policies Say Kℓ sync points budgeted to lemma ℓ To which features can we seek? For others, sequential decode DynProg: optimal expected probe
time with dynamic program Freq: allocate syncs at f with
largest probe prob. p(f |ℓ) Equi: measure off segments
with about equal number of bits EquiAndFreq: split budget
f3
f1
f4
f2f1f3
f5
14
Outer allocation policies Given overall budget K, how many syncs Kℓ
does leaf get?• Hit prob pℓ, bits in leaf segment bℓ
Analytical expression for effect of inner allocation can be intractable
Hit: K ℓ pℓ
HitBit: K ℓ pℓ bℓ
SqrtHitBit: Assume equispaced inner allocation
15
Experiments 500 million pages, mostly English, spam-free Catalog has about two million lemmas and entities
Sam
pler
Trainfold
Testfold
Spotter
Spotter
Testcontexts
Traincontexts
ℓ,f workload
Sm
ooth
er
Disambiguation trainer and cross-validator
ℓ,f(e,w)model
Compressor
L-F-E map
Annotator
Entity andtype
Indexer
Trainfold
Testcontexts
ℓ,f workload
Trainfold
Testfold
Testcontexts
ℓ,f workload
Trainfold
Traincontexts
Trainfold
Testcontexts
ℓ,f workload
Testfold
Cor
pus
Annotationindex
Smoothed ℓ,fdistribution
ℓ,f workload
“Pay
load
”“R
efer
ence
”Our best policycompresses LFEM down to only 18 bits/entry compared to 33.6 bits/entry marginalentropy, and 128 bits/entry raw data
16
Inner policies compared Equi close to
optimal DynProg but fast to compute
Freq surprisingly bad: long tail
Blending Equi and Freq worse than Equi alone
Relative order stable as sample size increased: long tail again
0.E0
4.E10
8.E10
1.2E11
5.E5 1.E6 1.5E6 2.E6 2.5E6 3.E6BudgetCos
t (u
s)
DynProg Equi
EquiAndFreq Freq
0E0
2E10
4E10
6E10
2 4 6 8 10Sample docs (M)Cos
t (u
s)
DynProgEquiEquiAndFreqFreq
Lookup cost: lower is better
17
Diagnosis: Freq vs. Equi
Plots show cumulative seek cost starting at sync• Collapse back to zero at next sync
Features with largest frequency not evenly placed Tail features in between lead to steep seek costs Equi never lets seek cost get out of hand (How about permuting features? See paper)
0
100
200
300
400
0E+0 2E+6 4E+6 6E+6 8E+6FeaturesE
qu
i Co
st0
5000
10000
15000
20000
0E+0 2E+6 4E+6 6E+6 8E+6FeaturesF
req
Co
st
Note scales
18
Outer policies compared
Inner policy set to best (DynProg) SqrtHitBit better than Bit better than HitBit Not surprising, given DynProg behaves closer to
Equi than Freq
0.E0
5.E10
1.E11
1.5E11
2.E11
5.E5 1.E6 1.5E6 2.E6 2.5E6 3.E6Budget->
Cos
t(us
)
Bit
HitBit
SqrtHitBit
Sync budget
Pro
be c
ost
SignHash, no training through collisions Build w from separate
lossless training Distorted from SignHash Most model values
severely distorted Give lossless and
SignHash same RAM Most keys collide Completely unacceptable
accuracy (random guessing is far better)
-200
-150
-100
-50
0
50
100
150
200
0.00 0.20 0.40 0.60 0.80
Relative rank-->
4000000005286798577000000001000000000140000000018000000002100000000
0.3
0.5
0.7
0.9
400 900 1400 1900Hash buckets (millions)
Accuracy
Collision
SignHash, training through collisions Used PEGASOS stochastic gradient descent
for training 77% of spots have label “NA” (no annotation) 23% error by choosing NA for all spots 11% error via lossless LFEM SignHash given same RAM as LFEM 18% error via SignHash Much better than no training But a lot worse than lossless LFEM Surprising, given LFEM currently uses plain
old naïve Bayes
21
Comparison with other systems
Downloaded software or network services Regression removes per-page, per-token overhead LFEM wins, largely because of syncs LFEM RAM << downloaded software
1E0
1E1
1E2
1E3
1E4
1E5
1E6
1E7
1E0 1E1 1E2 1E3 1E4#spots
Pag
e tim
e (m
s)
SpotlightWMinerZemantaLfem
System ms/spot
Spotlight 158
WMiner 21
Zemanta 9.5
LFEM 0.6
LFEM-sync 42
22
Conclusion Compressed in-memory multilevel maps for
disambiguation Random access via tuned sync allocation >20 GB down to 1.15 GB Faster than public disambiguation systems Annotate 500M pages with 2M Wikipedia
entities + index on 408 cores in ~18 hours Sparse models for better storage? Also in the paper: design of compressed
annotation index posting list