evolving the optimal relevancy ranking model at dice.com
TRANSCRIPT
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Evolving The Optimal Relevancy Scoring Model at Dice.comSimon Hughes
Chief Data Scientist, Dice.com
3
• Chief Data Scientist at Dice.com and DHI, under Yuri Bykov
• Dice.com – leading US job board for IT professionals• Twitter handle: https://twitter.com/hughes_meister
Who Am I?
• Dice Skills pages - http://www.dice.com/skills• New Dice Careers Mobile App
Key Projects
• PhD candidate at DePaul University, studying NLP and machine learning• Thesis topic – Detecting causality in scientific explanatory essays
PhD
4
• Look under https://github.com/DiceTechJobs• Set of Solr plugins https://github.com/DiceTechJobs/SolrPlugins• Tutorial for this talk: https://github.com/DiceTechJobs/RelevancyTuning
Open Source GitHub Repositories
5
1. Approaches to Relevancy Tuning
2. Automated Relevancy Tuning – using Reinforcement Learning
3. Feedback Loops - Dangers of Closed Loop Learning Systems
Overview
6
• Last year I talked about conceptual search and how that could be used to improve recall
• This year I want to focus on techniques to improve precision
• Novelty
Motivations for Talk
7
Finding the Optimal Search Engine Configuration
• Most companies initially approach this in a very ad hoc and manual process:
• Follow ‘best practices’ and make some initial educated guesses as to the best settings
• Manually tune the parameters on a number of key user queries
• The search engine parameters should be tuned to reflect how your users search
• Relevancy is a hard to define concept, but it’s what your users consider provides them with an optimal search experience. So it should be informed by their search behavior
Relevancy Tuning
8
What Solr Configuration Options Influence Relevancy?Solr and Lucene provide many configuration options that impact search relevancy, including:
• Which query parser – dismax, edismax, LuceneParser, etc• Field boosts – qf parameter• Phrase boosts – pf, pf2, pf3 parameters• Minimum should match - mm parameter• Similarity Class – default similarity, BM25, Tf.Idf, custom or one of many others• Boost queries – boost, bf, bq, etc• Edismax tie parameter – recommended value ≈ 0.1
11
• To tune your search parameters, you can gather a dataset of relevancy judgements
• For a set of important queries, the dataset will contain a set of relevancy judgements with the top results returned annotated for relevancy
• This dataset can be collected using domain experts and a user interface designed for this task
• Commercial Examples:
• Quepid – developed by OpenSource Connections
• Fusion UI Relevancy Workbench – part of the Fusion offering from Lucidworks
The ‘Golden’ Test Collection
12
13
• An alternative to manually collecting relevancy judgements is to collect them directly from your users
• For each user search on the site, capture:
• User’s query, and timestamp
• Any filters applied
• Result impressions and clicks
• You can then turn this into a test collection by assuming that the results that people click on are more relevant than those they don’t
• The time spent on the results page is also a great indication of how relevant that result was to the original search
Search Log Capture
14
• Now you have a test collection, you can use that to tune your search engine configuration
• Using the test collection, you can measure the relevancy of a set of searches on that collection using some IR metrics, such as:
• MAP (Mean Average Precision)
• Precision at K (compute precision at the k’th document retrieved)
• NDCG (Normalized Discounted Cumulative Gain)
• Regression testing – this allows you to build a set of regression tests to ensure configuration changes both improve relevancy and don’t break certain queries
• Manually tuning search configurations is still a time consuming and inefficient process
• Is there a better way?
Relevancy Tuning with a Test Collection
15
1. Supervised Machine Learning?• No - cannot optimize your search configuration without a computable gradient
2. Grid Search? • Perform a brute force search over a the range of possible configuration parameters• Very slow and inefficient – is not able to learn which ranges of settings work best
3. Black Box Optimization Algorithms?• Optimization algorithms exist that attempt to find the optimum value of an unknown function in as few
iterations as possible• Perform a much smarter search of the parameter space than grid search
Automated Relevancy Tuning Approaches
16
• Use an optimization algorithm to optimize a ‘black box’ function
• Black box function – provide the optimization algorithm with a function that takes a set of parameters as inputs and computes a score
• The black box algorithm will then try and choose parameter settings to optimize the score
• This can be thought of as a form of reinforcement learning
• These algorithms will intelligently search the space of possible search configurations to arrive at a solution
• Example algorithms include Bayesian Optimization, Simulated Annealing, and Genetic Algorithms (hence talk title)
Black Box Optimization Algorithms
17
Example Black Box Function for Search Relevancy
18
• There are some excellent mature libraries for doing this sort of thing e.g.
• DEAP
- Distributed Evolutionary Algorithms in Python (hence talk title)
• Scikit Optimize
– General optimization library built by a team at CERN headed by Tim Head
• These libraries are very easy to use, however getting them to optimize your search configuration is a little trickier
• They tend to work better when optimizing a small set of parameters at a time – 1 to 4 works well
• Achieved an improvement of 5% in MAP @ 5 for our MLT configuration. A\B testing changes to search before EOY
Making it Work
19
• To optimize a large set of search parameters – start with the most important ones and optimize those while keeping the rest fixed
• If you are using search logs to optimize the search configuration, use a large number of searches (at least a few thousand) to ensure you are performing a robust enough test
• For most search collections of a reasonable size, running these optimizations over your search collection will take time – set it up on a server, parallelize where possible and leave running overnight
• Typically you will want to allow the algorithm to try a few hundred variations of each parameter set at least to find a good range of settings
• Ideally – first optimize your search configuration against a set of relevancy judgements acquired from domain experts, deploy to production and use the search logs to further tune against your users search behavior
Making it Work
20
• As with any machine learning problem, it is essential to use one dataset to learn from, and a second separate dataset to validate your results – prevents ‘overfitting’
• Overfitting in this context means that the search parameters are over-tuned on your initial dataset, that the search engine performs worse on new data than with the current configuration
• Once you have an optimal set of configuration parameters, that you are happy with, these should be evaluated on a second set of relevancy judgements to ensure the same performance gains are seen there also
• This applies to both manual and automatic tuning of the search engine configuration. Humans can overfit a dataset just as easily as an algorithm can
Use a Separate Testing Dataset to Validate Improvements
21
• Auto-tune other solr parameters – phrase slop, mm settings, similarity class used
• Your can evolve a more optimal ranking function:
• Either tweak the settings of the existing ranking functions (see SweetSpotSimilarityFactory class)
• Or use Genetic Programming to evolve a better ranking function for your dataset
• Genetic Programming is an evolutionary algorithm that can evolve programs and equations
• Some relevant papers, good introductory paper (but not very recent)
Some Other Things to Try
22
• Building a Machine Learned Ranking system is a premature optimization if you haven’t first optimized your search configuration
• Relevancy tuning and MLR both primarily optimize for precision over recall due to nature of training data**
• For techniques to improve recall, see conceptual \ semantic search:
• Simon Hughes - “Conceptual Search” (Revolution 2015)
• Trey Grainger - “Enhancing Relevancy Through Personalization and Semantic Search” (Revolution 2013)
• Doug Turnbull and John Berryman - Chapter 11 of Relevant Search
Things to Consider
Feedback Loops – Dangers of Closed Loop Learning Systems
Users Interact with the System
ModelMachine Learning
Produce
Building a Machine Learning System
1. Users interact with the system to produce data
2. Machine learning algorithms turns that data into a model
What happens if the model’s predictions influence the user’s behavior?
Users Interact with the System
ModelProduce
Positive Feedback Loop
1. Users interact with the system to produce data
2. Machine learning algorithms turns that data into a model
3. Model changes user behavior, modifying its own future training data
Model changes behavior
Machine Learning
26
1. Isolate a subset of data from being influenced by the model, use this data to train the system
• E.g. leave a small proportion of user searches un-ranked by the MLR model
• E.g. generate a subset of recommendations at random, or by using an unsupervised model
2. Use a reinforcement learning model instead (such as a multi-armed bandit) - the system will dynamically adapt to the users’ behavior, balancing exploring different hypotheses with exploiting what it’s learned to produce accurate predictions
Preventing Positive Feedback Loops
27
THE END• Thank you for listening
• Any questions?
28