semantic & multilingual strategies in lucene/solr

47
Semantic & Multilingual Strategies in Lucene/Solr Trey Grainger Director of Engineering, Search & Analytics @CareerBuilder

Upload: trey-grainger

Post on 02-Jul-2015

4.141 views

Category:

Technology


3 download

DESCRIPTION

When searching on text, choosing the right CharFilters, Tokenizer, stemmers, and other TokenFilters for each supported language is critical. Additional tools of the trade include language detection through UpdateRequestProcessors, parts of speech analysis, entity extraction, stopword and synonym lists, relevancy differentiation for exact vs. stemmed vs. conceptual matches, and identification of statistically interesting phrases per language. For multilingual search, you also need to choose between several strategies such as: searching across multiple fields, using a separate collection per language combination, or combining multiple languages in a single field (custom code is required for this and will be open sourced). These all have their own strengths and weaknesses depending upon your use case. This talk will provide a tutorial (with code examples) on how to pull off each of these strategies as well as compare and contrast the different kinds of stemmers, review the precision/recall impact of stemming vs. lemmatization, and describe some techniques for extracting meaningful relationships between terms to power a semantic search experience per-language. Come learn how to build an excellent semantic and multilingual search system using the best tools and techniques Lucene/Solr has to offer!

TRANSCRIPT

Page 1: Semantic & Multilingual Strategies in Lucene/Solr

Semantic & Multilingual

Strategies in Lucene/Solr

Trey GraingerDirector of Engineering, Search & Analytics

@CareerBuilder

Page 2: Semantic & Multilingual Strategies in Lucene/Solr

Outline

• Introduction

• Text Analysis Refresher

• Language-specific text Analysis

• Multilingual Search Strategies

• Automatic Language Identification

• Semantic Search Strategies (understanding “meaning”)

• Conclusion

Page 3: Semantic & Multilingual Strategies in Lucene/Solr

About Me

Trey GraingerDirector of Engineering, Search & Analytics

Joined CareerBuilder in 2007 as Software Engineer

MBA, Management of Technology – GA Tech

BA, Computer Science, Business, & Philosophy – Furman University

Mining Massive Datasets (in progress) - Stanford University

Fun outside of CB:

• Co-author of Solr in Action, plus several research papers

• Frequent conference speaker

• Founder of Celiaccess.com, the gluten-free search engine

• Lucene/Solr contributor

Page 4: Semantic & Multilingual Strategies in Lucene/Solr

At CareerBuilder, Solr Powers...

Page 5: Semantic & Multilingual Strategies in Lucene/Solr

Text Analysis Refresher

Page 6: Semantic & Multilingual Strategies in Lucene/Solr

Text Analysis Refresher

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

Page 7: Semantic & Multilingual Strategies in Lucene/Solr

Text Analysis Refresher

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

Page 8: Semantic & Multilingual Strategies in Lucene/Solr

Text Analysis Refresher

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

Page 9: Semantic & Multilingual Strategies in Lucene/Solr

Text Analysis Refresher

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

Page 10: Semantic & Multilingual Strategies in Lucene/Solr

Language-specific Text Analysis

Page 11: Semantic & Multilingual Strategies in Lucene/Solr

Example English Analysis Chains

<fieldType name="text_en" class="solr.TextField"positionIncrementGap="100">

<analyzer><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StopFilterFactory"

words="lang/stopwords_en.txt”ignoreCase="true" />

<filter class="solr.LowerCaseFilterFactory"/><filter class="solr.EnglishPossessiveFilterFactory"/><filter class="solr.KeywordMarkerFilterFactory"

protected="lang/en_protwords.txt"/><filter class="solr.PorterStemFilterFactory"/>

</analyzer></fieldType>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">

<analyzer><charFilter class="solr.HTMLStripCharFilterFactory"/><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SynonymFilterFactory"

synonyms="lang/en_synonyms.txt" IignoreCase="true" expand="true"/>

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.ASCIIFoldingFilterFactory"/><filter class="solr.KStemFilterFactory"/><filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

</analyzer></fieldType>

Page 12: Semantic & Multilingual Strategies in Lucene/Solr

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

Page 13: Semantic & Multilingual Strategies in Lucene/Solr

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

Page 14: Semantic & Multilingual Strategies in Lucene/Solr

Which Stemmer do I choose?

*From Solr in Action, Chapter 14

Page 15: Semantic & Multilingual Strategies in Lucene/Solr

Common English Stemmers

*From Solr in Action, Chapter 14

Page 16: Semantic & Multilingual Strategies in Lucene/Solr

When Stemming goes awry

Fixing Stemming Mistakes:

• Unfortunately, every stemmer will have problem-cases that aren’t handled as you

would expect

• Thankfully, Stemmers can be overriden

• KeywordMarkerFilter: protects a list of terms you specify from being stemmed

• StemmerOverrideFilter: applies a list of custom term mappings you specify

Alternate strategy:

• Use Lemmatization (root-form analysis) instead of Stemming

• Commercial vendors help tremendously in this space(see http://www.basistech.com/case-study-career-builder/)

• The Hunspell stemmer enables dictionary-based support of varying quality in over

100 languages

Page 17: Semantic & Multilingual Strategies in Lucene/Solr

Stemming vs. Lemmatization

• Stemming: algorithmic manipulation of text, based upon common per-language

rules

• Lemmatization: finds the dictionary form of a term (lemma means “root”)

- dramatically improves precision (only matching terms that “should” match), while not significantly impacting recall (all terms that should match do match).

*From Solr in Action, Chapter 14

Page 18: Semantic & Multilingual Strategies in Lucene/Solr

Multilingual Search Strategies

Page 19: Semantic & Multilingual Strategies in Lucene/Solr

Multilingual Search Strategies

How do you handle:

…a different language per document?

…multiple languages in the same document?

…multiple languages in the same field?

Strategies:

1)Separate field per language

2)Separate collection/core per language

3)All languages in one field

Page 20: Semantic & Multilingual Strategies in Lucene/Solr

Strategy 1: Separate field per language

*From Solr in Action, Chapter 14

Page 21: Semantic & Multilingual Strategies in Lucene/Solr

Separate field per language

<field name="id" type="string" indexed="true" stored="true" />

<field name="title" type="string" indexed="true" stored="true" />

<field name="content_english" type="text_english" indexed="true” stored="true" />

<field name="content_french" type="text_french" indexed="true” stored="true" />

<field name="content_spanish" type="text_spanish" indexed="true” stored="true" />

<fieldType name="text_english" class="solr.TextField"

positionIncrementGap="100">

<analyzer>

<tokenizerclass="solr.StandardTokenizerFactory"/>

<filter class="solr.StopFilterFactory” ignoreCase="true"

words="lang/stopwords_en.txt"/>

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPossessiveFilterFactory"/>

<filter class="solr.KeywordMarkerFilterFactory"

protected="protwords.txt"/>

<filter class="solr.KStemFilterFactory"/>

</analyzer>

</fieldType>

<fieldType name="text_spanish"

class="solr.TextField"

positionIncrementGap="100">

<analyzer>

<tokenizer

class="solr.StandardTokenizerFactory"/>

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.StopFilterFactory"

ignoreCase="true" words="lang/

stopwords_es.txt" format="snowball"/>

<filter class="solr.SpanishLightStemFilterFactory"/>

</analyzer>

</fieldType>

<fieldType name="text_french"

class="solr.TextField"

positionIncrementGap="100">

<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.ElisionFilterFactory” ignoreCase="true"

articles="lang/contractions_fr.txt"/>

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.StopFilterFactory"

ignoreCase="true" words="lang/stopwords_fr.txt”

format="snowball"/>

<filter class="solr.FrenchLightStemFilterFactory"/>

</analyzer>

</fieldType>

schema.xml

*From Solr in Action, Chapter 14

Page 22: Semantic & Multilingual Strategies in Lucene/Solr

Separate field per language:

one language per document <doc><field name="id">1</field>

<field name="title">The Adventures of Huckleberry Finn</field>

<field name="content_english">YOU don't know about me without you have read

a book by the name of The Adventures of Tom Sawyer; but that ain't no

matter. That book was made by Mr. Mark Twain, and he told the truth,

mainly. There was things which he stretched, but mainly he told the truth.

<field>

</doc><doc><field name="id ">2</field>

<field name="title">Les Misérables</field>

<field name="content_french">Nul n'aurait pu le dire; tout ce qu'on savait,

c'est que, lorsqu'il revint d'Italie, il était prêtre.

</field>

</doc><doc><field name="id">3</field>

<field name="title">Don Quixote</field>

<field name="content_spanish">Demasiada cordura puede ser la peor de las

locuras, ver la vida como es y no como debería de ser.

</field>

</doc>

Query:

http://localhost:8983/solr/field-per-language/select?

fl=title&

defType=edismax&

qf=content_english content_french content_spanish&

q="he told the truth"

OR "il était prêtre"

OR "ver la vida como es"

Response:

{

"response":{"numFound":3,"start":0,"docs":[

{

"title":["The Adventures of Huckleberry Finn"]},

{

"title":["Don Quixote"]},

{

"title":["Les Misérables"]}]

} *From Solr in Action, Chapter 14

Page 23: Semantic & Multilingual Strategies in Lucene/Solr

Separate field per language:

multiple languages per document

Query 1:http://localhost:8983/solr/field-per-language/select?fl=title&defType=edismax&qf=content_english content_french content_spanish&

q="wisdom”

Query 2:http://localhost:8983/solr/field-per-language/select?...

q="sabiduría”

Query 3:http://localhost:8983/solr/field-per-language/select?...

q="sagesse”

Response: (same for queries 1–3)

{

"response":{"numFound":1,"start":0,"docs":[{"title":["Proverbs"]}]

}

Documents:

<doc><field name="id">4</field>

<field name="title">Proverbs</field>

<field name="content_spanish"> No la abandones y ella velará sobre

ti, ámala y ella te protegerá. Lo principal es la sabiduría; adquiere

sabiduría, y con todo lo que obtengas adquiere inteligencia.

</field>

<field name="content_english">Do not forsake wisdom, and she

will protect you; love her, and she will watch over you. Wisdom is supreme;

therefore get wisdom. Though it cost all you have, get understanding.

</field>

<field name="content_french">N'abandonne pas la sagesse, et elle te

gardera, aime-la, et elle te protégera. Voici le début de la sagesse:

acquiers la sagesse, procure-toi le discernement au prix de tout

ce que tu possèdes.

<field>

</doc>*From Solr in Action, Chapter 14

Page 24: Semantic & Multilingual Strategies in Lucene/Solr

Summary: Separate field per language

*From Solr in Action, Chapter 14

Page 25: Semantic & Multilingual Strategies in Lucene/Solr

Strategy 2: Separate collection per language

*From Solr in Action, Chapter 14

Page 26: Semantic & Multilingual Strategies in Lucene/Solr

Separate collection per language: schema.xml

*From Solr in Action, Chapter 14

Page 27: Semantic & Multilingual Strategies in Lucene/Solr

Separate collection per language:

Indexing & Querying

Indexing:

cd $SOLR_IN_ACTION/example-docs/

java -jar -Durl=http://localhost:8983/solr/english/update post.jar

➥ ch14/documents/english.xml

java -jar -Durl=http://localhost:8983/solr/spanish/update post.jar

➥ ch14/documents/spanish.xml

java -jar -Durl=http://localhost:8983/solr/french/update post.jar

➥ ch14/documents/french.xml

Query (collections in Solr Cloud):

http://localhost:8983/solr/aggregator/select?

shards=english,spanish,french

df=content&

q=query in any language here

Query (specific cores):

http://localhost:8983/solr/aggregator/select?

shards=localhost:8983/solr/english,

localhost:8983/solr/spanish,

localhost:8983/solr/french&

df=content&

q=query in any language here

Documents:

All documents just have a single

“content” field. The documents get

routed to a different language-specific

Solr collection based upon the

language of the content field.

*From Solr in Action, Chapter 14

Page 28: Semantic & Multilingual Strategies in Lucene/Solr

Summary: Separate index per language

*From Solr in Action, Chapter 14

Page 29: Semantic & Multilingual Strategies in Lucene/Solr

Strategy 3: One Field for all languages

*From Solr in Action, Chapter 14

Page 30: Semantic & Multilingual Strategies in Lucene/Solr

One Field for all languages: Feature Status

• Note: This feature is not yet committed to Solr

• I’m working on it in my free time. Currently it supports:

• Update Request Processor which can automatically detect the languages of documents and choose the correct analyzers

• Field Type which allows dynamically choosing one or more analyzers on a per-field (indexing) and per term (querying) basis.

• Current Code from Solr in Action is available and is freely available on github.

• There is a JIRA ticket open to ultimately contribute this back to Solr: Solr-6492

• Some work is still necessary to make querying more user friendly.

Page 31: Semantic & Multilingual Strategies in Lucene/Solr

One Field for all languages

Step 1: Define Multilingual Field

schema.xml:<fieldType name="multilingual_text"

class="sia.ch14.MultiTextField" sortMissingLast="true" defaultFieldType="text_general"fieldMappings="en:text_english,

es:text_spanish,fr:text_french,de:text_german"/> [1]

<field name="text" type="multilingual_text" indexed="true" multiValued="true" />

[1] Note that "text_english", "text_spanish", "text_french", and "text_german" refer to field types defined elsewhere in the schema.xml[2] Uses the "defaultFieldType", in this case "text_general", defined elsewhere in schema.xml

<add><doc>…

<field name="text">general keywords</field> [2]<field name="text”>en,es|the school, las escuelas</field>…

</doc></add><add><doc>…

<field name="text">en|the school</field>

<field name="text">es|las escuelas</field>…</doc></add>

Step 2: Index documents

http://localhost:8983/solr/collection1/select?

q=es|escuela OR en,es,de|school OR school [2]

Step 3: Search

Page 32: Semantic & Multilingual Strategies in Lucene/Solr

One Field For All Languages: Stacked Token Streams

1) English Field 2) Spanish Field

3) English + Spanish combined in Multilingual Text Field

multilingual_text

① For each language requested, the appropriate field type is chosen

② The input text is passed separately to the Analyzer chain for each field type

③ The resulting Token Streams from each Analyzer chain arestacked into a unified Token Stream based upon their position increments

*Screenshot from Solr in Action, Chapter 14

Page 33: Semantic & Multilingual Strategies in Lucene/Solr

Strategy 3: All languages in one field

*

*See Solr in Action, Chapter 14

Page 34: Semantic & Multilingual Strategies in Lucene/Solr

Automatic Language Identification

Page 35: Semantic & Multilingual Strategies in Lucene/Solr

Identifying languages in documents

solrconfig.xml...

<updateRequestProcessorChain name="langid">

<processor class="org.apache.solr.update.processor.

LangDetectLanguageIdentifierUpdateProcessorFactory">

<lst name="invariants">

<str name="langid.fl">content,content_lang1,content_lang2,content_lang3</str>

<str name="langid.langField">language</str>

<str name="langid.langsField">languages</str>

...

</lst>

</processor>

..

</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.UpdateRequestHandler">

<lst name="invariants">

<str name="update.chain">langid</str>

</lst>

</requestHandler>

...

schema.xml

...

<field name="language" type="string" indexed="true" stored="true" />

<field name="languages" type="string" indexed="true" stored="true" multiValued="true"/>

...

*See Solr in Action, Chapter 14

Page 36: Semantic & Multilingual Strategies in Lucene/Solr

Identifying languages in documents

Sending documents:

cd $SOLR_IN_ACTION/example-docs/

java -Durl=http://localhost:8983/solr/langid/update

➥ -jar post.jar ch14/documents/langid.xml

Query

http://localhost:8983/solr/langid/select?

q=*:*&

fl=title,language,languages

Results[{ "title":"The Adventures of Huckelberry Finn",

"language":"en",

"languages":["en"]},

{

"title":"Les Misérables",

"language":"fr",

"languages":["fr"]},

{

"title":"Don Quoxite",

"language":"es",

"languages":["es"]},

{

"title":"Proverbs",

"language":"fr",

"languages":["fr”, "en”,"es"]}]*See Solr in Action, Chapter 14

Page 37: Semantic & Multilingual Strategies in Lucene/Solr

Mapping data to language-specific fields

solrconfig.xml...

<updateRequestProcessorChain name="langid">

<processor class="org.apache.solr.update.processor.

LangDetectLanguageIdentifierUpdateProcessorFactory">

<lst name="invariants">

<str name="langid.fl">content</str>

<str name="langid.langField">language</str>

<str name="langid.map">true</str>

<str name="langid.map.fl">content</str>

<str name="langid.whitelist">en,es,fr</str>

<str name="langid.map.lcmap">

en:english es:spanish fr:french</str>

<str name="langid.fallback">en</str>

</lst>

</processor>

...

</updateRequestProcessorChain>

...

Indexed Documents:[{

"title":"The Adventures of Huckleberry Finn","language":"en","content_english":[

"YOU don't know about me without..."]},{"title":"Les Misérables","language":"fr","content_french":[

"Nul n'aurait pu le dire; tout ce..."]},{"title":"Don Quixote","language":"es","content_spanish":[

"Demasiada cordura puede ser la peor..."]}]}]

*See Solr in Action, Chapter 14

Page 38: Semantic & Multilingual Strategies in Lucene/Solr

Semantic Strategies

Page 39: Semantic & Multilingual Strategies in Lucene/Solr

The need for Semantic Search

User’s Query: machine learning research and development Portland, OR software engineer AND hadoop java

Traditional Query Parsing:(machine AND learning AND research AND development AND portland)OR (software AND engineer AND hadoop AND java)

Semantic Query Parsing:"machine learning" AND "research and development" AND "Portland, OR”AND "software engineer" AND hadoop AND java

Semantically Expanded Query:("machine learning"^10 OR "data scientist" OR "data mining" OR "computer vision")AND ("research and development"^10 OR "r&d") AND

AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo})

AND ("software engineer"^10 OR "software developer")AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)

Page 40: Semantic & Multilingual Strategies in Lucene/Solr

Semantic Search Architecture – Query Parsing

1) Generate Model of Domain-specific phrases

• Can mine query logs or actual text of documents for significant phrases within your domain [1]

2) Feed known phrases to SolrTextTagger (uses Lucene FST

for high-throughput term lookups)

3) Use SolrTextTagger to perform entity extraction

on incoming queries (tagging documents is also optional)

4) Shown on next slide:

Pass extracted entities to a Query Augmentation phase to

rewrite query with enhanced semantic understanding

(synonyms, related keywords, related categories, etc.)

[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

[2] https://github.com/OpenSextant/SolrTextTagger

Page 41: Semantic & Multilingual Strategies in Lucene/Solr

machine learning

Keywords:

Search Behavior,Application Behavior, etc.

Job Title Classifier, Skills Extractor, Job Level Classifier, etc.

Clustering relationships

Semantic Query Augmentation

keywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) }{ BOOST_TO_TOP: (job_title:("software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) }

Modified Query:

Related Occupations

machine learning: {15-1031.00 .58Computer Software Engineers, Applications

15-1011.00 .55Computer and Information Scientists, Research

15-1032.00 .52 Computer Software Engineers, Systems Software }

machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, }

Common Job Titles

Semantic Search Architecture – Query Augmentation

Related Phrases

machine learning: { data mining .9,

matlab .8,data scientist .75, artificial intelligence .7, neural networks .55 }

Known keyword phrases

java developermachine learningregistered nurse

Page 42: Semantic & Multilingual Strategies in Lucene/Solr

Differentiating related terms

Synonyms: cpa => certified public accountant

rn => registered nurser.n. => registered nurse

Ambiguous Terms*: driver => driver (trucking) ~80%driver => driver (software) ~20%

Related Terms: r.n. => nursing, bsnhadoop => mapreduce, hive, pig

*differentiated based upon user and query context

Page 43: Semantic & Multilingual Strategies in Lucene/Solr

Semantic Search “under the hood”

Page 44: Semantic & Multilingual Strategies in Lucene/Solr

2014 Publications & Presentations

Books:Solr in Action - A comprehensive guide to implementing scalable search using Apache Solr

Research papers:● Towards a Job title Classification System

● Augmenting Recommendation Systems Using a Model of Semantically-related Terms Extracted from User Behavior

● sCooL: A system for academic institution name normalization

● Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific jargon

● PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems

● SKILL: A System for Skill Identification and Normalization

Speaking Engagements:● WSDM 2014 Workshop: “Web-Scale Classification: Classifying Big Data from the Web”

● Atlanta Solr Meetup

● Atlanta Big Data Meetup

● The Second International Symposium on Big Data and Data Analytics

● Lucene/Solr Revolution 2014

● RecSys 2014

● IEEE Big Data Conference 2014

Page 45: Semantic & Multilingual Strategies in Lucene/Solr

Conclusion

• Language analysis options for each language are very

configurable

• There are multiple strategies for handling multilingual

content based upon your use case

• When in doubt, automatic language detection can be

easily leveraged in your indexing pipeline

• The next generation of query/relevancy improvements

will be able to understand the intent of the user.

Page 46: Semantic & Multilingual Strategies in Lucene/Solr

Contact Info

Yes, WE ARE HIRING @CareerBuilder. Come talk with me if you are interested…

Trey [email protected]@treygrainger

http://solrinaction.comConference discount (43% off): lusorevcftw

Other presentations: http://www.treygrainger.com