![Page 1: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/1.jpg)
![Page 2: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/2.jpg)
Solr and Lucene @ AOLSEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
![Page 3: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/3.jpg)
1999• Believe, Cher and Livin’ la Vida Loca, Ricky Martin
• The Matrix and The Phantom Menace
• Windows 98 Second Edition
• AltaVista, Northern Light, Yahoo, ODP, Inktomi– Google
• PPC Text search ads invented 1998– Banner ads
![Page 4: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/4.jpg)
A Brief History of Search @ AOL
• Acquired PLS in 1998• AOL Search used ODP• Site Search• Local Search• Built into AOL Server• CPL
– VSM then BM25– Phrase, numeric, date, text, and
proximity boosting– Conflation classes (like synonyms)
![Page 5: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/5.jpg)
Relevance
• Precision/recall• “free alcohol” vs. “alcohol free”• Lawyer versus Attorney• Iron and ironic same stem (Porter)• Beyonce vs. Beyoncé• Eagles
–Bird, sports teams, band, AMC Eagle• F 15, F-15, F15• FREAK
Relevant Retrieved
![Page 6: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/6.jpg)
The Dawn of Solr
• Prohibitively expensive to continue CPL development
• Complicated deployment
• 2005: Investigating migration to Lucene
• 2006: CNET open sourced Solr
![Page 7: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/7.jpg)
Contributions
• Local Lucene/Solr (superseded by SpatialSearch)
• Query Timeout
• Data Import Handler (DIH)
• Numerous smaller patches
• Committers: Noble Paul, Shalin Mangar, Patrick O’Leary
![Page 8: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/8.jpg)
Contributing to Solr/Lucene
• Learn
–Join the mailing lists•[email protected]•[email protected]
–Read search and Solr related blogs
–The #solr IRC channel on freenode
![Page 9: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/9.jpg)
Contributing to Solr/Lucene
• Help others
–Answer questions.
–Improve documentation in the code, the wiki, or the website.
–Make improvements to the Solr Admin UI.
![Page 10: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/10.jpg)
Contributing to Solr/Lucene
• Confirm a bug
• Submit a patch for a reported bug or feature request
• Improve a patch
• Try out a patch and see if it works
![Page 11: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/11.jpg)
Contributing to Solr/Lucene• Submit your own tickets
– Bug– Feature request
• Start with solr-user@lucene• Discuss on dev@lucene• Create Jira ticket, ideally with patches and unit tests
• Yonik’s Law of Patches:– A half-baked patch in Jira, with no documentation, no tests, and no
backwards compatibility is better than no patch at all.
![Page 12: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/12.jpg)
Applications• MapQuest (SpatialSearch)• Mail• AIM• AOL Search• Site Search• News Search• RUM• Sarah Palin e-mails (admin)• Demand• Wikipedia article pattern detection
![Page 13: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/13.jpg)
MapQuest Discover
![Page 14: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/14.jpg)
Travel Blogs
![Page 15: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/15.jpg)
MQ Local Search
![Page 16: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/16.jpg)
Related Searches
![Page 17: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/17.jpg)
Bipartite graph snippet
![Page 18: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/18.jpg)
Related Searches Graph
Page 18
“The Eagles”
The band
NFL
Boston College
Hotel California
Tribute
![Page 19: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/19.jpg)
Related Searches• Simple query
– User• New York Library
– Solr query• Lower case• Prefer exact match “new york library”• Use phrase slop to allow terms in same order and near each
other, e.g., new york city public library• primeQuery:“new york library” OR “new york library”~3
![Page 20: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/20.jpg)
Wikipedia Traffic Correlation Schema
<field name="title" type="string" indexed="true" stored="true" required="true" />
<field name="title_norm" type="string" indexed="true" stored="true" required="true" />
<field name="total_pvs" type="long" indexed="true" stored="true" required="true" />
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<!-- trend direction. field name contains date string, e.g., "trend_20110622" -->
<dynamicField name="trend_*" type="int" indexed="true" stored="true"/>
<!-- page views. field name contains date string, e.g., "pvs_20110622" -->
<dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
![Page 21: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/21.jpg)
Temporal Traffic Correlation of Wikipedia Page Views
![Page 22: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/22.jpg)
Sarah Palin E-mail Stats
• 13,177 documents
• 4 hours from receiving data to production install
• ~150 K requests per day at launch
• Now about 6-7 K requests per day
• Running on 3 VMs in two different data centers behind a NetScaler
![Page 23: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/23.jpg)
![Page 24: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/24.jpg)
Faceting and Clustering
![Page 25: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/25.jpg)
Huffington Post Comments• Solr 4
• Uses Solr Cloud
• Single shard
• ReplicationFactor 3
• Real-time
• 90 days of comments
• Tested up to 100 writes / second
![Page 26: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/26.jpg)
More HuffPost comments
• Used by editors and moderators–Topic investigation–Troll detection
• Config–Special features: search for emoticons, prefer
exact match, date boosting
• Hack-a-thon comment clustering, timeline, and summarization
![Page 27: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/27.jpg)
Solr Comments Architecture
Message Queue
MongoDBMongoIngestor
Solr Ingestor
Solr Cloud
Uses SolrJ CloudSolrServer
Tools Server
JuLiA
![Page 28: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/28.jpg)
Relevance in Solr
• “free alcohol” vs. “alcohol free”–Phrase queries and phrase slop
• Lawyer versus Attorney–SynonymFilterFactory
• Iron and ironic–Kstem, or Lemmatization via the
SynonymFilterFactory instead of Snowball/Porter
![Page 29: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/29.jpg)
Relevance in Solr
• Beyonce vs. Beyoncé–Various Folding Filters
• Eagles–Boost on other fields, such as popularity,
publish date–Use related searches, facets, or clustering
• F 15, F-15, F15–WordDelimiterFilter
![Page 30: Solr At AOL, Presented by Sean Timm at SolrExchage DC](https://reader035.vdocuments.site/reader035/viewer/2022062513/55509923b4c9058b208b47f5/html5/thumbnails/30.jpg)
Bringing a New Search Project Online• Understand the domain
• Ingest (sample) data
• Clean data
• Repeat
• Relevance testing
• Scale out
• Launch/Success