in solr multilingual searchdata-con.org/wp-content/uploads/2014/09/david-troiano...approaches to...

34
Multilingual Search in Solr

Upload: dodat

Post on 07-May-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Multilingual Searchin Solr

Talk Overview

● The problem we’re trying to solve● Natural language processing (NLP)● Approaches to multilingual search in Solr

A Multilingual Search Example

The Goal

Build a search engine where:

● document corpus spans multiple languages○ potentially mixed language documents

● queries within a language, or potentially spanning multiple

NLP Meets Search (Querying)

Terms

Inverted Index

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

query: “clinton speaking”

NLP pipeline

clinton, speak

NLP Meets Search (Indexing) Document 123

Terms

Inverted Index

NLP pipeline

Bill Clinton spoke about ...

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

bill, clinton, speak, about

NLP Meets Search Document 123

Terms

Inverted Index

NLP pipeline

Bill Clinton spoke about ...

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

query: “clinton speaking”

NLP pipeline

bill, clinton, speak, aboutclinton, speak

Solr

NLP Meets Search Document 123

Terms

Inverted Index

NLP pipeline

Bill Clinton spoke about ...

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

query: “clinton speaking”

NLP pipeline

bill, clinton, speak, aboutclinton, speak

Solr

The NLP Pipeline

● Language Detection● Tokenization● Decompounding● Normalization

Language Detection

● Often required when indexing● Typically not used at query time

○ Lower accuracy on short strings○ Sometimes unsolvable even to humans, e.g., named

entities○ End user applications often know query language

upstream of search engine○ No readily available plugin pattern (in Solr)

Tokenization

● Breaking text into words● Particular difficult with CJK languages

○ Find the words: 帰国後ハーバード大学に入学を認められていたもの

Decompounding

● Breaking compound words into subcomponents

● Common in German, Dutch, Korean○ Samstagmorgen Samstag, morgen

Normalization

● Reduce word form variations to a canonical representation

● Critical for recall● Two approaches

○ Stemming○ Lemmatization

Normalization: Stemming

● Simple rules-based approach● “chop off the end”

○ arsenal, arsenic arsen

Normalization: Lemmatization

● Map words to their dictionary form via morphological analysis

● spoke, speaks, speaking speak● Higher precision and recall compared to

stemming

NLP Meets Search Document 123

Terms

Inverted Index

NLP pipeline

Bill Clinton spoke about ...

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

query: “clinton speaking”

NLP pipeline

bill, clinton, speak, aboutclinton, speak

Solr

Solr Intro

● Apache open source project● Search server framework built on Lucene● Document-oriented, scalable, flexible● Widely used in large production systems

○ eBay, Instagram, Netflix, ...

NLP Within Solr

● Maximal precision / recall requires NLP pipeline per language

● NLP pipeline specified within Solr field type● Index / query strategies in Solr

○ field per language○ core per language○ other

Field per languageschema.xml <field name="content_cjk" type="text_cjk" indexed="true" stored="true" />

<field name="content_eng" type="text_eng" indexed="true" stored="true" />

<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">

<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.CJKWidthFilterFactory"/>

<filter class="solr.CJKBigramFilterFactory"/>

</analyzer>

</fieldType>

query http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng

Field per language

http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_engq=serie%20a

Field per language

http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_engdefType=edismax

Field per language

http://.../select?q=serie%20a&defType=edismax&qf=content_cjk%20content_engqf=content_cjk%20content_eng

Core per languageCJK core’s schema.xml <field name="content" type="text_cjk" indexed="true" stored="true" multiValued="true"/>

<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">

<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.CJKWidthFilterFactory"/>

<filter class="solr.CJKBigramFilterFactory"/>

</analyzer>

</fieldType>

query http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_eng

Core per language

http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_engq=content:serie%20a

Core per language

http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_engshards=<url>/articles_cjk,<url>/articles_eng

Approach Comparison

Field per language Core per language

Simplicity

Speed

An Alternative Approach

All languages in a single field● requires custom meta field type that is

applies per-language concrete field type(s)● patch submitted to Solr

cf. Solr In Action / Trey Grainger https://github.com/treygrainger/solr-in-action

An Alternative Approach

Terms

Inverted Index

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

query: “[en, es]clinton”

Inspect [en, es], apply English and Spanish field types to “clinton”, merge results

clinton, speak

An Alternative Approach

● Results scoring potentially worse than other approaches

● IDF thrown off with single field○ e.g., soy common in Spanish, relatively rare in

English○ Consider a query for “soy dessert recipe” against a

corpus of English and Spanish recipes

Enhancing NLP Pipeline

● Limitations of NLP in Solr out of the box○ Poor precision / performance of CJK tokenization○ Poor precision / recall of stemmers (no lemmatizers)○ Poor recall due to lack of decompounding

Rosette to the rescue!

CJK Tokenizationケネディはマサチューセッツ

● Rosette: ケネディ, は, マサチューセッツ● Bigrams: ケネ, ネデ, ディ, ィは, はマ, マサ, サ

チ, チュ, ュー, ーセ, セッ, ッツ● How does this impact precision, recall, index

size, speed?

Rosette In Solr<fieldType name="text_zho" class="solr.TextField">

<analyzer type="index">

<tokenizer

class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"

rootDirectory="<rootDir>"

language="zho" />

<filter

class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"

rootDirectory="<rootDir>"

language="zho" />

</analyzer>

</fieldType>

cf. http://www.basistech.com/search-essentials/

Wrapping Up

● Multilingual search is everywhere● Solr as a search platform● Search quality hinges on quality of NLP tools

Contact

Dave [email protected]