real world machine learning at orbitz, strata 2011

12
Machine Learning at Orbitz Robert Lancaster and Jonathan Seidman Strata 2011 February 02 | 2011

Upload: jonathan-seidman

Post on 28-Nov-2014

3.571 views

Category:

Documents


3 download

DESCRIPTION

Slides from the "Real World Applications Panel: Machine Learning and Decision Support" at Strata 2011

TRANSCRIPT

Page 1: Real World Machine Learning at Orbitz, Strata 2011

Machine Learning at Orbitz

Robert Lancaster and Jonathan Seidman

Strata 2011

February 02 | 2011

Page 2: Real World Machine Learning at Orbitz, Strata 2011

page 2

Launched: 2001, Chicago, IL

Page 3: Real World Machine Learning at Orbitz, Strata 2011

Why Start the Machine Learning Team at Orbitz?

•  Team was created in 2009 with the goal to apply machine learning techniques to improve the customer experience.

•  For example:

– Hotel sort optimization: How can we improve the ranking of hotel search results in order to show consumers hotels that more closely match their preferences?

– Cache optimization: can we intelligently cache hotel rates in order to optimize the performance of hotel searches?

– Personalization/segmentation: can we show targeted search results to specific consumer segments?

page 3

Page 4: Real World Machine Learning at Orbitz, Strata 2011

Data Challenges

•  The team immediately faced challenges getting access to data:

– Performing required analysis requires access to large amounts of data on user interaction with the site.

–  This data is available in web analytics logs, but required fields were not available in our data warehouse because of size considerations.

– Even worse, we had no archive of the data beyond several days.

– Size constraints aside, there’s considerable time and effort to get new data added to the data warehouse.

page 4

Page 5: Real World Machine Learning at Orbitz, Strata 2011

New Data Infrastructure to Address These Challenges

•  Hadoop provides a solution to these challenges by:

– Providing long-term storage of entire raw dataset without placing constraints on how that data is processed.

– Allowing us to immediately take advantage of new web analytics data added to the site.

– Providing a platform for efficient analysis of data, as well as preparation of data for input to external processes for further analysis.

•  Hive was added to the infrastructure to provide structure over the prepared data, facilitating ad-hoc queries and selection of specific data sets for analysis.

•  Data stored in Hive not only supports machine learning efforts, but also provides metrics to analysts not available through other sources.

page 5

Page 6: Real World Machine Learning at Orbitz, Strata 2011

New Data Infrastructure – Cont’d

•  Hadoop and Hive are now being used by the machine learning team to:

– Extract data from logs for hotel sort and cache optimization analyses.

– Distribute complex cross-validation and performance evaluation operations.

– Extracting data for clustering.

•  Hadoop and Hive have also gained rapid adoption in the organization beyond the machine learning team: evaluating page download performance, searching production logs, keyword analysis, etc.

page 6

Page 7: Real World Machine Learning at Orbitz, Strata 2011

Use Case – Hotel Cache Optimization

Overview:

Search methodology:

•  Subset of total properties in a location (1 page at a time).

•  Get “just enough” information to present to consumers.

Caching:

•  Reduces impact to suppliers (maintain “look-to-book” ratio).

•  Reduces latency.

•  Increases “coverage.”

Optimization Goal:

Improve the customer experience (reduce latency, increase coverage) when searching for hotel rates while controlling impact on suppliers (maintain look-to-book).

page 7

Page 8: Real World Machine Learning at Orbitz, Strata 2011

Hotel Cache Optimization – Early Attempts

Early approaches were well intended, but were not driven by analysis of the available data. For example:

Theory: High amount of thrashing leads to eviction of more useful cache entries.

Attempted Solution: Increase cache size.

Result: No increase in measured coverage.

Problem: No actual analysis on required cache size.

Theory: Locally managed inventory represents “free” information and can be requested without limit to improve coverage.

Attempted Solution: Don’t cache locally managed inventory. Increase the amount of local inventory requested with each user search.

Result: No increase in measured coverage.

Problem: Locally managed inventory doesn’t represent a large percentage of total inventory and is already highly preferenced.

page 8

Page 9: Real World Machine Learning at Orbitz, Strata 2011

Hotel Cache Optimization – Data Driven Approaches

Data Driven Approaches:

Traffic Partitioning: Identify the subset of traffic that is most efficient and optimize that subset through prefetching and increased bursting.

TTL Optimization: Use historic logs of availability and rate change information to predict volatility of hotel rates and optimize cache TTL.

page 9

Page 10: Real World Machine Learning at Orbitz, Strata 2011

Hotel Cache Optimization– Traffic Distribution

page 10

2.78%

34.30% 31.87%

71.67%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Queries

Searches

Reverse Running Total (Searches)

Reverse Running Total (Queries)

72% of queries are singletons and make up nearly a third of total search volume.

A small number of queries (3%) make up more than a third of search volume.

Page 11: Real World Machine Learning at Orbitz, Strata 2011

Optimize Hotel Cache – Traffic Partitioning

Evaluate possible mechanisms for determining most frequent queries.

Favor mechanisms that gives high search/query ratio for the greatest percentage of search volume.

Test for stability of mechanism across multiple time periods.

Par$on  Strategy   Descrip$on   Pct  Queries  Pct  Searches  Searches/Query  

Baseline   All  traffic   100.00%   100.00%   2.19  

Top  50   Top  50  searched  markets   14.88%   26.76%   3.94  

HeurisCc  Top  50  searched  markets,    weekend  stay  within  1  month.   0.87%   8.52%   21.4  

EnumeraCon   Queries  repeated  5  or  more  Cmes.   3.45%   28.80%   18.29  

PredicCon   TBD   TBD   TBD   TBD  

page 11

Page 12: Real World Machine Learning at Orbitz, Strata 2011

Conclusions and Lessons Learned

•  Start with a manageable problem (ease of measuring success, availability of data, etc.)

•  Avoid thinking of machine learning team as an R&D organization.

•  Instead, foster machine learning approaches throughout the organization:

– Embed resources on actual feature teams.

– Machine learning study groups, etc.

page 12