in solr multilingual searchdata-con.org/wp-content/uploads/2014/09/david-troiano...approaches to...
TRANSCRIPT
Talk Overview
● The problem we’re trying to solve● Natural language processing (NLP)● Approaches to multilingual search in Solr
The Goal
Build a search engine where:
● document corpus spans multiple languages○ potentially mixed language documents
● queries within a language, or potentially spanning multiple
NLP Meets Search (Querying)
Terms
Inverted Index
term document IDs
... ...
clinton …, 123, ...
... ...
speak …, 123, ...
query: “clinton speaking”
NLP pipeline
clinton, speak
NLP Meets Search (Indexing) Document 123
Terms
Inverted Index
NLP pipeline
Bill Clinton spoke about ...
term document IDs
... ...
clinton …, 123, ...
... ...
speak …, 123, ...
bill, clinton, speak, about
NLP Meets Search Document 123
Terms
Inverted Index
NLP pipeline
Bill Clinton spoke about ...
term document IDs
... ...
clinton …, 123, ...
... ...
speak …, 123, ...
query: “clinton speaking”
NLP pipeline
bill, clinton, speak, aboutclinton, speak
Solr
NLP Meets Search Document 123
Terms
Inverted Index
NLP pipeline
Bill Clinton spoke about ...
term document IDs
... ...
clinton …, 123, ...
... ...
speak …, 123, ...
query: “clinton speaking”
NLP pipeline
bill, clinton, speak, aboutclinton, speak
Solr
Language Detection
● Often required when indexing● Typically not used at query time
○ Lower accuracy on short strings○ Sometimes unsolvable even to humans, e.g., named
entities○ End user applications often know query language
upstream of search engine○ No readily available plugin pattern (in Solr)
Tokenization
● Breaking text into words● Particular difficult with CJK languages
○ Find the words: 帰国後ハーバード大学に入学を認められていたもの
Decompounding
● Breaking compound words into subcomponents
● Common in German, Dutch, Korean○ Samstagmorgen Samstag, morgen
Normalization
● Reduce word form variations to a canonical representation
● Critical for recall● Two approaches
○ Stemming○ Lemmatization
Normalization: Lemmatization
● Map words to their dictionary form via morphological analysis
● spoke, speaks, speaking speak● Higher precision and recall compared to
stemming
NLP Meets Search Document 123
Terms
Inverted Index
NLP pipeline
Bill Clinton spoke about ...
term document IDs
... ...
clinton …, 123, ...
... ...
speak …, 123, ...
query: “clinton speaking”
NLP pipeline
bill, clinton, speak, aboutclinton, speak
Solr
Solr Intro
● Apache open source project● Search server framework built on Lucene● Document-oriented, scalable, flexible● Widely used in large production systems
○ eBay, Instagram, Netflix, ...
NLP Within Solr
● Maximal precision / recall requires NLP pipeline per language
● NLP pipeline specified within Solr field type● Index / query strategies in Solr
○ field per language○ core per language○ other
Field per languageschema.xml <field name="content_cjk" type="text_cjk" indexed="true" stored="true" />
<field name="content_eng" type="text_eng" indexed="true" stored="true" />
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>
query http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng
Field per language
http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_engq=serie%20a
Field per language
http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_engdefType=edismax
Field per language
http://.../select?q=serie%20a&defType=edismax&qf=content_cjk%20content_engqf=content_cjk%20content_eng
Core per languageCJK core’s schema.xml <field name="content" type="text_cjk" indexed="true" stored="true" multiValued="true"/>
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>
query http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_eng
Core per language
http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_engq=content:serie%20a
Core per language
http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_engshards=<url>/articles_cjk,<url>/articles_eng
An Alternative Approach
All languages in a single field● requires custom meta field type that is
applies per-language concrete field type(s)● patch submitted to Solr
cf. Solr In Action / Trey Grainger https://github.com/treygrainger/solr-in-action
An Alternative Approach
Terms
Inverted Index
term document IDs
... ...
clinton …, 123, ...
... ...
speak …, 123, ...
query: “[en, es]clinton”
Inspect [en, es], apply English and Spanish field types to “clinton”, merge results
clinton, speak
An Alternative Approach
● Results scoring potentially worse than other approaches
● IDF thrown off with single field○ e.g., soy common in Spanish, relatively rare in
English○ Consider a query for “soy dessert recipe” against a
corpus of English and Spanish recipes
Enhancing NLP Pipeline
● Limitations of NLP in Solr out of the box○ Poor precision / performance of CJK tokenization○ Poor precision / recall of stemmers (no lemmatizers)○ Poor recall due to lack of decompounding
Rosette to the rescue!
CJK Tokenizationケネディはマサチューセッツ
● Rosette: ケネディ, は, マサチューセッツ● Bigrams: ケネ, ネデ, ディ, ィは, はマ, マサ, サ
チ, チュ, ュー, ーセ, セッ, ッツ● How does this impact precision, recall, index
size, speed?
Rosette In Solr<fieldType name="text_zho" class="solr.TextField">
<analyzer type="index">
<tokenizer
class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
rootDirectory="<rootDir>"
language="zho" />
<filter
class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
rootDirectory="<rootDir>"
language="zho" />
</analyzer>
</fieldType>
cf. http://www.basistech.com/search-essentials/
Wrapping Up
● Multilingual search is everywhere● Solr as a search platform● Search quality hinges on quality of NLP tools