lots of facets, fast
DESCRIPTION
We created a web application for a well-known US newspaper, to create a maps-like zooming application on top of the 60,000 newspapers since 1850 and using Solr over the 28,000,000 articles to create an interactive heatmap over it. The out-of-the-box faceting solution was optimized using domain knowledge by order-of-magnitude which allowed us to create a great visual way of exploring trends in historical newspapers.TRANSCRIPT
lots of facets, fast
Anne Veling, [email protected], May 26th 2011
introduction
Anne Veling• Freelance Search Architect• Lucene Trainer
Proquest New York Times
2
visualization
data• 1851 up to 2006: almost 60k newspapers
How to give semantic overview• Context, where am I• Detail
Exploration and Discovery
3
zoom
Present all newspapers on one canvas Dynamic zooming and panning Search interface
• for discovery
Front-end by Q42• HTML5 app• iPad app
Not yet live
4
architecture
5
text
imagesTile
Generator
Indexer solrindex
Web Server
solrserver
facetplugin
tiles
client
tiling
Newspaper images, old ones scanned• TIFF form• Wrinkles, coffee stains
Tile generator• Convert to jpg• One virtual canvas of 512Gpixel• Multilayers 3M tiles: ~100Gb in 11 levels
6
search
25,072,989 articles 867M solr index DataImportHandler
• Issue with memory: load all XML URLs in memory first
• Solved by indexing in batches
Special• Nothing stored, not even IDs• We need nothing returned from search…
7
8
0
maxDoc 2
…
results facets
query
4
faceting memory
Store each facet as BitSet over 25M articles• 58k facets x 25M docs x 1 bit = 169Gb (memory!)
So we use DocSet from Solr• Scarce bitarray -> now fits in 1Gb memory
9
faceting performance
Facet initialization• Takes ~1.5minute• Cached
Facet evaluation• Runtime!• #docs x #facets
10
query
performance
Facet initialization/creation Runtime faceting
Solr LRU cache Creation of all facets ~72s Runtime evaluation ootb: 71 seconds…
/select/?q=Amsterdam&version=2.2&start=0&rows=10&indent=on&facet.date=thedate&facet.date.start=1850-01-01T00:00:00Z&facet.date.end=2007-01-01T00:00:00Z&facet.date.gap=%2B1DAY&facet=true
Client-side bottleneck vs Server-side
11
<filterCache class="solr.FastLRUCache" size="70000" initialSize="512" autowarmCount="0"/>
Improved performance to ~300ms for “Amsterdam” [1825] query!• 2.3Mb output…
<requestHandler name="/zoomr" class="com.proquest.zoom.ZoomrRequestHandler"></requestHandler>
Custom json output• Base 36 encoded heatmap
12
01111111111111111122111222777986878768885568855899beddbcebbadabcbfgffggjmkgilrrwxwzuonpb9noolnljjjkkhhllllkjgipmdimlbbhkahf77987afghhihjihjikjikifeefgppsomf8000
runtime facet optimization
60,656 facets Worst case facet #DocSet.exists(doc)
• Originally: 25M x 60k = 1.5E12 checks, 60k per doc
• Now: average 0.5x for each level = 34.5 per doc
13
16 decades
160 years
1,920 months
58,560 days
optimization
Custom facet runtime Collector• Break if facet matched
single value per doc per facet each doc has only 1 day
• Top-down facet selection decade – year – month – day
Performance for 1850 docs and 60k docs improved from 300ms to 10ms
Custom optimized heatmap json Bottleneck now in the client/canvas/js
14
show us or it didn’t happen
Web Application iPad App
15
zooming
16
facet heatmap
17
“television”
“inflation”
conclusions
Great exploratory UI Use domain knowledge to optimize for
performance• If you can
Next• Bring it live on the Web and in App Store• Using it for 1.2M books/CDs/DVDs of Belgium• More search options• Multipage
18
enhancement suggestions
Lucene Collector• def collect(doc: Int):Boolean
Solr SingleValueFacet Break after first find Automatic order based on #counts?
19
class ExistsCollector extends Collector { var exists = false
def collect(doc: Int) = { exists = true false }
def acceptsDocsOutOfOrder() = true def setNextReader(reader: IndexReader, base: Int) {} def setScorer(scorer: Scorer) {}}
lessons learned
Java Graphics has limitations for large fonts (>26,000)
Handling large data sets is tricky• Indexing• Copying
There’s technology and there’s corporate agendas
You can always make things 10x faster• Lucene is ridiculously fast
If you configure it well
• Using domain knowledge can get you far
20