lots of facets, fast

lots of facets, fast

Anne Veling, [email protected], May 26th 2011

introduction

Anne Veling• Freelance Search Architect• Lucene Trainer

Proquest New York Times

2

visualization

data• 1851 up to 2006: almost 60k newspapers

How to give semantic overview• Context, where am I• Detail

Exploration and Discovery

3

zoom

Present all newspapers on one canvas Dynamic zooming and panning Search interface

• for discovery

Front-end by Q42• HTML5 app• iPad app

Not yet live

4

architecture

5

text

imagesTile

Generator

Indexer solrindex

Web Server

solrserver

facetplugin

tiles

client

tiling

Newspaper images, old ones scanned• TIFF form• Wrinkles, coffee stains

Tile generator• Convert to jpg• One virtual canvas of 512Gpixel• Multilayers 3M tiles: ~100Gb in 11 levels

6

search

25,072,989 articles 867M solr index DataImportHandler

• Issue with memory: load all XML URLs in memory first

• Solved by indexing in batches

Special• Nothing stored, not even IDs• We need nothing returned from search…

7

8

0

maxDoc 2

…

results facets

query

4

faceting memory

Store each facet as BitSet over 25M articles• 58k facets x 25M docs x 1 bit = 169Gb (memory!)

So we use DocSet from Solr• Scarce bitarray -> now fits in 1Gb memory

9

faceting performance

Facet initialization• Takes ~1.5minute• Cached

Facet evaluation• Runtime!• #docs x #facets

10

query

performance

Facet initialization/creation Runtime faceting

Solr LRU cache Creation of all facets ~72s Runtime evaluation ootb: 71 seconds…

/select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on&facet.date=thedate&facet.date.start=1850-01-01T00:00:00Z&facet.date.end=2007-01-01T00:00:00Z&facet.date.gap=%2B1DAY&facet=true

Client-side bottleneck vs Server-side

11

<filterCache class="solr.FastLRUCache" size="70000" initialSize="512" autowarmCount="0"/>

Improved performance to ~300ms for “Amsterdam” [1825] query!• 2.3Mb output…

<requestHandler name="/zoomr" class="com.proquest.zoom.ZoomrRequestHandler"></requestHandler>

Custom json output• Base 36 encoded heatmap

12

01111111111111111122111222777986878768885568855899beddbcebbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdimlbbhkahf77987afghhihjihjikjikifeefgppsomf8000

runtime facet optimization

60,656 facets Worst case facet #DocSet.exists(doc)

• Originally: 25M x 60k = 1.5E12 checks, 60k per doc

• Now: average 0.5x for each level = 34.5 per doc

13

16 decades

160 years

1,920 months

58,560 days

optimization

Custom facet runtime Collector• Break if facet matched

single value per doc per facet each doc has only 1 day

• Top-down facet selection decade – year – month – day

Performance for 1850 docs and 60k docs improved from 300ms to 10ms

Custom optimized heatmap json Bottleneck now in the client/canvas/js

14

show us or it didn’t happen

Web Application iPad App

15

zooming

16

facet heatmap

17

“television”

“inflation”

conclusions

Great exploratory UI Use domain knowledge to optimize for

performance• If you can

Next• Bring it live on the Web and in App Store• Using it for 1.2M books/CDs/DVDs of Belgium• More search options• Multipage

18

enhancement suggestions

Lucene Collector• def collect(doc: Int):Boolean

Solr SingleValueFacet Break after first find Automatic order based on #counts?

19

class ExistsCollector extends Collector { var exists = false

def collect(doc: Int) = { exists = true false }

def acceptsDocsOutOfOrder() = true def setNextReader(reader: IndexReader, base: Int) {} def setScorer(scorer: Scorer) {}}

lessons learned

Java Graphics has limitations for large fonts (>26,000)

Handling large data sets is tricky• Indexing• Copying

There’s technology and there’s corporate agendas

You can always make things 10x faster• Lucene is ridiculously fast

If you configure it well

• Using domain knowledge can get you far

20

thank you

[email protected]

@anneveling

21

lots of facets, fast

Technology