evolving the optimal relevancy ranking model at dice.com

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Evolving The Optimal Relevancy Scoring Model at Dice.comSimon Hughes

Chief Data Scientist, Dice.com

3

• Chief Data Scientist at Dice.com and DHI, under Yuri Bykov

• Dice.com – leading US job board for IT professionals• Twitter handle: https://twitter.com/hughes_meister

Who Am I?

• Dice Skills pages - http://www.dice.com/skills• New Dice Careers Mobile App

Key Projects

• PhD candidate at DePaul University, studying NLP and machine learning• Thesis topic – Detecting causality in scientific explanatory essays

PhD

https://twitter.com/hughes_meister

https://twitter.com/hughes_meister

http://www.dice.com/skills

http://insights.dice.com/2016/05/09/check-out-the-new-dice-careers-app/

4

• Look under https://github.com/DiceTechJobs• Set of Solr plugins https://github.com/DiceTechJobs/SolrPlugins• Tutorial for this talk: https://github.com/DiceTechJobs/RelevancyTuning

Open Source GitHub Repositories

https://github.com/DiceTechJobs/SolrPlugins



https://github.com/DiceTechJobs/RelevancyTuning



5

1. Approaches to Relevancy Tuning

2. Automated Relevancy Tuning – using Reinforcement Learning

3. Feedback Loops - Dangers of Closed Loop Learning Systems

Overview

6

• Last year I talked about conceptual search and how that could be used to improve recall

• This year I want to focus on techniques to improve precision

• Novelty

Motivations for Talk

7

Finding the Optimal Search Engine Configuration

• Most companies initially approach this in a very ad hoc and manual process:

• Follow ‘best practices’ and make some initial educated guesses as to the best settings

• Manually tune the parameters on a number of key user queries

• The search engine parameters should be tuned to reflect how your users search

• Relevancy is a hard to define concept, but it’s what your users consider provides them with an optimal search experience. So it should be informed by their search behavior

Relevancy Tuning

8

What Solr Configuration Options Influence Relevancy?Solr and Lucene provide many configuration options that impact search relevancy, including:

• Which query parser – dismax, edismax, LuceneParser, etc• Field boosts – qf parameter• Phrase boosts – pf, pf2, pf3 parameters• Minimum should match - mm parameter• Similarity Class – default similarity, BM25, Tf.Idf, custom or one of many others• Boost queries – boost, bf, bq, etc• Edismax tie parameter – recommended value ≈ 0.1

11

• To tune your search parameters, you can gather a dataset of relevancy judgements

• For a set of important queries, the dataset will contain a set of relevancy judgements with the top results returned annotated for relevancy

• This dataset can be collected using domain experts and a user interface designed for this task

• Commercial Examples:

• Quepid – developed by OpenSource Connections

• Fusion UI Relevancy Workbench – part of the Fusion offering from Lucidworks

The ‘Golden’ Test Collection

http://quepid.com/

https://doc.lucidworks.com/fusion/1.2/Search/Fusion-UI-Relevancy-Workbench-Tool.html

13

• An alternative to manually collecting relevancy judgements is to collect them directly from your users

• For each user search on the site, capture:

• User’s query, and timestamp

• Any filters applied

• Result impressions and clicks

• You can then turn this into a test collection by assuming that the results that people click on are more relevant than those they don’t

• The time spent on the results page is also a great indication of how relevant that result was to the original search

Search Log Capture

14

• Now you have a test collection, you can use that to tune your search engine configuration

• Using the test collection, you can measure the relevancy of a set of searches on that collection using some IR metrics, such as:

• MAP (Mean Average Precision)

• Precision at K (compute precision at the k’th document retrieved)

• NDCG (Normalized Discounted Cumulative Gain)

• Regression testing – this allows you to build a set of regression tests to ensure configuration changes both improve relevancy and don’t break certain queries

• Manually tuning search configurations is still a time consuming and inefficient process

• Is there a better way?

Relevancy Tuning with a Test Collection

15

1. Supervised Machine Learning?• No - cannot optimize your search configuration without a computable gradient

2. Grid Search? • Perform a brute force search over a the range of possible configuration parameters• Very slow and inefficient – is not able to learn which ranges of settings work best

3. Black Box Optimization Algorithms?• Optimization algorithms exist that attempt to find the optimum value of an unknown function in as few

iterations as possible• Perform a much smarter search of the parameter space than grid search

Automated Relevancy Tuning Approaches

16

• Use an optimization algorithm to optimize a ‘black box’ function

• Black box function – provide the optimization algorithm with a function that takes a set of parameters as inputs and computes a score

• The black box algorithm will then try and choose parameter settings to optimize the score

• This can be thought of as a form of reinforcement learning

• These algorithms will intelligently search the space of possible search configurations to arrive at a solution

• Example algorithms include Bayesian Optimization, Simulated Annealing, and Genetic Algorithms (hence talk title)

Black Box Optimization Algorithms

17

Example Black Box Function for Search Relevancy

18

• There are some excellent mature libraries for doing this sort of thing e.g.

• DEAP

- Distributed Evolutionary Algorithms in Python (hence talk title)

• Scikit Optimize

– General optimization library built by a team at CERN headed by Tim Head

• These libraries are very easy to use, however getting them to optimize your search configuration is a little trickier

• They tend to work better when optimizing a small set of parameters at a time – 1 to 4 works well

• Achieved an improvement of 5% in MAP @ 5 for our MLT configuration. A\B testing changes to search before EOY

Making it Work

https://github.com/DEAP

https://github.com/scikit-optimize

19

• To optimize a large set of search parameters – start with the most important ones and optimize those while keeping the rest fixed

• If you are using search logs to optimize the search configuration, use a large number of searches (at least a few thousand) to ensure you are performing a robust enough test

• For most search collections of a reasonable size, running these optimizations over your search collection will take time – set it up on a server, parallelize where possible and leave running overnight

• Typically you will want to allow the algorithm to try a few hundred variations of each parameter set at least to find a good range of settings

• Ideally – first optimize your search configuration against a set of relevancy judgements acquired from domain experts, deploy to production and use the search logs to further tune against your users search behavior

Making it Work

20

• As with any machine learning problem, it is essential to use one dataset to learn from, and a second separate dataset to validate your results – prevents ‘overfitting’

• Overfitting in this context means that the search parameters are over-tuned on your initial dataset, that the search engine performs worse on new data than with the current configuration

• Once you have an optimal set of configuration parameters, that you are happy with, these should be evaluated on a second set of relevancy judgements to ensure the same performance gains are seen there also

• This applies to both manual and automatic tuning of the search engine configuration. Humans can overfit a dataset just as easily as an algorithm can

Use a Separate Testing Dataset to Validate Improvements

21

• Auto-tune other solr parameters – phrase slop, mm settings, similarity class used

• Your can evolve a more optimal ranking function:

• Either tweak the settings of the existing ranking functions (see SweetSpotSimilarityFactory class)

• Or use Genetic Programming to evolve a better ranking function for your dataset

• Genetic Programming is an evolutionary algorithm that can evolve programs and equations

• Some relevant papers, good introductory paper (but not very recent)

Some Other Things to Try

http://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/search/similarities/SweetSpotSimilarityFactory.html

https://en.wikipedia.org/wiki/Genetic_programming

https://scholar.google.com/scholar?q=genetic+programming+information+retrieval+ranking+function&hl=en&as_sdt=0,14&as_ylo=2004&as_yhi=

http://www.sciencedirect.com/science/article/pii/S0306457303000700

22

• Building a Machine Learned Ranking system is a premature optimization if you haven’t first optimized your search configuration

• Relevancy tuning and MLR both primarily optimize for precision over recall due to nature of training data**

• For techniques to improve recall, see conceptual \ semantic search:

• Simon Hughes - “Conceptual Search” (Revolution 2015)

• Trey Grainger - “Enhancing Relevancy Through Personalization and Semantic Search” (Revolution 2013)

• Doug Turnbull and John Berryman - Chapter 11 of Relevant Search

Things to Consider

https://www.youtube.com/watch?v=WYOkb1BQG2E

https://www.youtube.com/watch?v=WYOkb1BQG2E

https://www.youtube.com/watch?v=zA6WC31zLXI

https://www.amazon.com/Relevant-Search-applications-Solr-Elasticsearch/dp/161729277X/ref=sr_1_1?ie=UTF8&qid=1475892437&sr=8-1&keywords=relevant+search

Feedback Loops – Dangers of Closed Loop Learning Systems

Users Interact with the System

ModelMachine Learning

Produce

Building a Machine Learning System

1. Users interact with the system to produce data

2. Machine learning algorithms turns that data into a model

What happens if the model’s predictions influence the user’s behavior?

Users Interact with the System

ModelProduce

Positive Feedback Loop

1. Users interact with the system to produce data

2. Machine learning algorithms turns that data into a model

3. Model changes user behavior, modifying its own future training data

Model changes behavior

Machine Learning

26

1. Isolate a subset of data from being influenced by the model, use this data to train the system

• E.g. leave a small proportion of user searches un-ranked by the MLR model

• E.g. generate a subset of recommendations at random, or by using an unsupervised model

2. Use a reinforcement learning model instead (such as a multi-armed bandit) - the system will dynamically adapt to the users’ behavior, balancing exploring different hypotheses with exploiting what it’s learned to produce accurate predictions

Preventing Positive Feedback Loops

27

THE END• Thank you for listening

• Any questions?

evolving the optimal relevancy ranking model at dice.com

Technology