hadoop and hive at orbitz, hadoop world 2010

34
Hadoop and Hive at Orbitz Jonathan Seidman and Ramesh Venkataramaiah Hadoop World 2010

Upload: jonathan-seidman

Post on 27-Jan-2015

134 views

Category:

Documents


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hadoop and Hive at Orbitz, Hadoop World 2010

Hadoop and Hive at Orbitz

Jonathan Seidman and Ramesh Venkataramaiah

Hadoop World 2010

Page 2: Hadoop and Hive at Orbitz, Hadoop World 2010

Agenda

•  Orbitz Worldwide

•  The challenge of big data at Orbitz

•  Hadoop as a solution to the data challenge

•  Applications of Hadoop and Hive at Orbitz – improving hotel sort

•  Sample analysis and data trends

•  Other uses of Hadoop and Hive at Orbitz

•  Lessons learned and conclusion

page 2

Page 3: Hadoop and Hive at Orbitz, Hadoop World 2010

page 3

Launched: 2001, Chicago, IL

Page 4: Hadoop and Hive at Orbitz, Hadoop World 2010

page 4

Orbitz…�…poster children for Hadoop �

Page 5: Hadoop and Hive at Orbitz, Hadoop World 2010

Data Challenges at Orbitz

On Orbitz alone we do millions of searches and transactions daily, which leads to hundreds of gigabytes of log data every day.

So how do we store and process all of this data?

page 5

Page 6: Hadoop and Hive at Orbitz, Hadoop World 2010

page 6

Utterly redonkulous amounts of money

$ per managed TB

Page 7: Hadoop and Hive at Orbitz, Hadoop World 2010

page 7

Utterly redonkulous amounts of money

More reasonable amounts of money $ per managed TB

Page 8: Hadoop and Hive at Orbitz, Hadoop World 2010

•  Adding data to our data warehouse also requires a lengthy plan/implement/deploy cycle.

•  Because of the expense and time our data teams need to be very judicious about which data gets added. This means that potentially valuable data may not be saved.

•  We needed a solution that would allow us to economically store and process the growing volumes of data we collect.

page 8

Page 9: Hadoop and Hive at Orbitz, Hadoop World 2010

page 9

Hadoop brings our cost per TB down to $1500 (or even less)

Page 10: Hadoop and Hive at Orbitz, Hadoop World 2010

•  Important to note that Hadoop is not a replacement to a data warehouse, but rather is a complement to it.

•  On the other hand, Hadoop offers benefits other than just cost.

page 10

Page 11: Hadoop and Hive at Orbitz, Hadoop World 2010

page 11

Page 12: Hadoop and Hive at Orbitz, Hadoop World 2010

page 12

How can we improve hotel ranking?

Hey! Let’s use machine learning!

All the cool kids are doing it!

Page 13: Hadoop and Hive at Orbitz, Hadoop World 2010

Requires data – lots of data

•  Web analytics software providing session data about user behavior.

•  Unfortunately specific data fields we needed weren’t loaded into our data warehouse, and just to make things worse the only archive of raw logs available only went back a few days.

•  We decided to turn to Hadoop to provide a long-term archive for these logs.

•  Storing raw data in HDFS provides access to data not available elsewhere, for example “hotel impression” data: –  115004,1,70.00;35217,2,129.00;239756,3,99.00;83389,4,99.00!

page 13

Page 14: Hadoop and Hive at Orbitz, Hadoop World 2010

Now we need to process the data…

•  Extract data from raw Webtrends logs for input to a trained classification process.

•  Logs provide input to MapReduce processing which extracts required fields.

•  Previous process used a series of Perl and Bash scripts to extract data serially.

•  Comparison of performance

– Months worth of data

– Manual process took 109m14s

– MapReduce process took 25m58s

page 14

Page 15: Hadoop and Hive at Orbitz, Hadoop World 2010

Processing Flow – Step 1

page 15

Page 16: Hadoop and Hive at Orbitz, Hadoop World 2010

Processing Flow – Step 2

page 16

Page 17: Hadoop and Hive at Orbitz, Hadoop World 2010

Processing Flow – Step 3

page 17

Page 18: Hadoop and Hive at Orbitz, Hadoop World 2010

Processing Flow – Step 4

page 18

Page 19: Hadoop and Hive at Orbitz, Hadoop World 2010

Processing Flow – Step 5

page 19

Page 20: Hadoop and Hive at Orbitz, Hadoop World 2010

Processing Flow – Step 6

page 20

Page 21: Hadoop and Hive at Orbitz, Hadoop World 2010

Once data is in hive…

•  Provides input data to machine learning processes.

•  Used to create data exports for further analysis with R scripts, allowing us to derive more complex statistics and visualizations of our data.

•  Provides useful metrics, many of which were unavailable with our existing data stores.

•  Used for aggregating data for import into our data warehouse for creation of new data cubes, providing analysts access to data unavailable in existing data cubes.

page 21

Page 22: Hadoop and Hive at Orbitz, Hadoop World 2010

Statistical Analysis: Infrastructure and Dataset

page 22

•  Hive + R platform for query processing and statistical analysis.

•  R - Open-source stat package with visualization.

•  Hive Dataset:

– Customer hotel booking on our sites and User rating of hotels.

•  Investigation:

– Are there built-in data bias? Any Lurking variables?

– What approximations and biases exist?

– Are variables pair-wise correlated?

– Are there macro patterns?

Page 23: Hadoop and Hive at Orbitz, Hadoop World 2010

Statistical Analysis - Positional Bias

page 23

•  Lurking variable is… Positional Bias.

•  Top positions invariably picked the most.

•  Aim to position Best Ranked Hotels at the top based on customer search criteria and user ratings.

Page 24: Hadoop and Hive at Orbitz, Hadoop World 2010

Statistical Analysis - Kernel Density

page 24

•  User Ratings of Hotels

•  Strongly affected by the number of bins used.

•  Kernel density plots are usually a much more effective way to overcome the limitations of histograms.

Page 25: Hadoop and Hive at Orbitz, Hadoop World 2010

Statistical Analysis - Exploratory correlation

page 25

Page 26: Hadoop and Hive at Orbitz, Hadoop World 2010

Statistical Analysis - More seasonal variations

page 26

•  Customer hotel stay gets longer during summer months

•  Could help in designing search based on seasons.

•  Outliers removed.

Page 27: Hadoop and Hive at Orbitz, Hadoop World 2010

Analysis: take away’s…

page 27

•  Costs of cleaning and processing data is significant.

•  Tendency to create stories out of noise.

•  “Median is not the message”; Find macro patterns first.

•  If website originated data, watch for hidden bias in data collection.

Page 28: Hadoop and Hive at Orbitz, Hadoop World 2010

Lessons Learned

•  Make sure you’re using the appropriate tool – avoid the temptation to start throwing all of your data in Hadoop when a relational store may be a better choice.

•  Expect the unexpected in your data. When processing billions of records of data it’s inevitable that you’ll encounter at least one bad record which will blow up your processing.

•  To get buy-in from upper management, present a long-term, unstructured data growth story and explain how this will help harness long-tail opportunities.

page 28

Page 29: Hadoop and Hive at Orbitz, Hadoop World 2010

Lessons Learned (continued)

•  Hadoop’s limited security model creates challenges when trying to deploy Hadoop in the enterprise.

•  Configuration currently seems to be a black art. It can be difficult to understand which parameters to set and how to determine an optimal configuration.

•  Watch your memory use. Sloppy programming practices will bite you when your code needs to process large volumes of data.

page 29

Page 30: Hadoop and Hive at Orbitz, Hadoop World 2010

Hadoop is a virus…

page 30

Page 31: Hadoop and Hive at Orbitz, Hadoop World 2010

Just a few more examples of how Hadoop is being used at Orbitz…

•  Measuring page download performance: using web analytics logs as input, a set of MapReduce scripts are used to derive detailed client side performance metrics which allow us to track trends in page download times.

•  Searching production logs: an effort is underway to utilize Hadoop to store and process our large volume of production logs, allowing developers and analysts to perform tasks such as troubleshooting production issues.

•  Cache analysis: extraction and aggregation of data to provide input to analyses intended to improve the performance of data caches utilized by our web sites.

page 31

Page 32: Hadoop and Hive at Orbitz, Hadoop World 2010

Applications of Hadoop at orbitz are just beginning…

•  We’re in the process of quadrupling the capacity of our production cluster.

•  Multiple teams are working on new applications of Hadoop

•  We continue to explore the use of associated tools – Hbase, Pig, Flume, etc.

page 32

Page 33: Hadoop and Hive at Orbitz, Hadoop World 2010

References

•  Hadoop project: http://hadoop.apache.org/

•  Hive project: http://hadoop.apache.org/hive/

•  Hive – A Petabyte Scale Data Warehouse Using Hadoop: http://i.stanford.edu/~ragho/hive-icde2010.pdf

•  Hadoop The Definitive Guide, Tom White, O’Reilly Press, 2009

•  Why Model, J. Epstein, 2008

•  Beautiful Data, T. Segaran & J. Hammerbacher, 2009

•  Karmasphere Developer Study: http://www.karmasphere.com/images/documents/Karmasphere-HadoopDeveloperResearch.pdf

page 33

Page 34: Hadoop and Hive at Orbitz, Hadoop World 2010

Contact

•  Jonathan Seidman:

–  [email protected]

– @jseidman

– Chicago area Hadoop User Group: http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG/

•  Ramesh Venkataramaiah:

–  [email protected]

page 34