Download - Natural Language Search in Solr
![Page 2: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/2.jpg)
Agenda An approach to natural language search in
Solr Main points• Solr-UIMA integration module• Custom Lucene analyzers for UIMA• OSS NLP algorithms in Lucene/Solr• Orchestrating blocks to build a sample
system able to understand natural language queries
Results
![Page 3: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/3.jpg)
My Background Software engineer at Sourcesense• Enterprise search consultant
Member of the Apache Software Foundation• UIMA• Clerezza• Stanbol• DirectMemory• ...
![Page 4: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/4.jpg)
Google in ‘99
![Page 5: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/5.jpg)
Google today
![Page 6: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/6.jpg)
Google today
![Page 7: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/7.jpg)
The Challenge Improved recall/precision• ‘articles about science’ (concepts)• ‘movies by K. Spacey’ vs ‘movies with K. Spacey’
Easier experience for non-expert users• ‘people working at Google’ - ‘cities near London’
Horizontal domains (e.g. Google) Vertical domains
![Page 8: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/8.jpg)
Hurdles understanding documents’ text/user queries extract domain-specific/wide entities and
concepts index/search performance
![Page 9: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/9.jpg)
Use Case search engine for an online movies magazine Solr based non technical users time / cost• Solr 3.x setup : 2 mins• NLS setup / tweak : 5 days
expecting• improved recall / precision• more time (clicks) on site ($)
![Page 10: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/10.jpg)
Online movies magazine
![Page 11: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/11.jpg)
General approach Natural language processing Processing documents at indexing time• document text analysis• write enriched text in (dedicated) fields• add custom types / payloads to terms
Processing queries at searching time• query analysis• higher boosts to entities/concepts• in-sentence search • ...
![Page 12: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/12.jpg)
NLP AI discipline• Computers understanding and managing
information written in human language analyze text at various levels incrementally enrich / give structure extract concepts and named entities
![Page 13: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/13.jpg)
Technical detail NLP algorithms plugged via Apache UIMA Indexing time• UpdateProcessor plugin (solr/contrib/uima)• Custom tokenizers/filters
Search time• Custom QParserPlugin
![Page 14: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/14.jpg)
Why Apache UIMA? OASIS standard for UIM TLP since March 2010 Deploy pipelines of Analysis Engines AEs wrap NLP algorithms Scaling capabilities
![Page 15: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/15.jpg)
NLP and OSS Sentence Split• OpenNLP, UIMA Addons, StanfordNLP
PoS tagging• OpenNLP, UIMA Addons, StanfordNLP
Chunking/Parsing• OpenNLP, StanfordNLP
NER• OpenNLP, UIMA Addons, Stanbol, StanfordNLP
Clustering/Classifying• Mahout, OpenNLP, StanfordNLP
...
![Page 16: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/16.jpg)
Solr NLS architecture
![Page 17: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/17.jpg)
UIMA Update Processor
![Page 18: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/18.jpg)
Lucene analysis & UIMA Type : denote lexical types for tokens Payload : a byte array stored at each term
position tokenize / filter tokens covered by a certain
annotation type store UIMA annotations’ features in types /
payloads
![Page 19: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/19.jpg)
UIMA type-aware tokenizer
![Page 20: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/20.jpg)
Solr NLS QParser analyze user query extract (and query on) concepts / entities use types/PoS in the query for • boosting terms• synonim expansion
search within sentences faceting / clustering using entities identify ‘place queries’ and expand Solr spatial
queries (for filtering / boosting)
![Page 21: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/21.jpg)
Scaling architecture
![Page 22: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/22.jpg)
Performance basic (in memory)• slower with NRT indexing• search could be significantly impacted
ReST (SimpleServer)• faster• need to explictly digest results
UIMA-AS• fast also with NRT indexing• fast search• scales nicely with lots of data
![Page 23: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/23.jpg)
DisMax vs NLS
![Page 24: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/24.jpg)
Wrap up general purpose architecture generally improved recall / precision NLP algorithms accuracy make the difference lots of OSS alternatives performances can be kept good
![Page 25: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/25.jpg)
Sources Resources • http://svn.apache.org/repos/asf/lucene/dev/trunk/
solr/contrib/uima/• https://github.com/tteofili/le11-nls
Links• http://wiki.apache.org/solr/SolrUIMA• http://googleblog.blogspot.com/2010/01/helping-
computers-understand-language.html
![Page 26: Natural Language Search in Solr](https://reader034.vdocuments.site/reader034/viewer/2022051515/54c663e94a795941198b4575/html5/thumbnails/26.jpg)
Thanks http://www.sourcesense.com
@tteofili