hbasecon 2012 | hbase, the use case in ebay cassini

13
HBase the Use Case in eBay Cassini Thomas Pan Principal Software Engineer eBay Marketplaces

Upload: cloudera-inc

Post on 09-May-2015

3.185 views

Category:

Technology


0 download

DESCRIPTION

eBay marketplace has been working hard on the next generation search infrastructure and software system, code-named Cassini. The new search engine processes over 250 million search queries and serves more than 2 billion page views each day. Its indexing platform is based on Apache Hadoop and Apache HBase. Apache HBase is a distributed persistent layer built on Hadoop to support billions of updates per day. Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds. We will share with the audience the technical details and share the difficulties and challenges that we’ve gone through and that we are still facing in the process.

TRANSCRIPT

Page 1: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

HBasethe Use Case in eBay CassiniThomas PanPrincipal Software EngineereBay Marketplaces

Page 2: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

eBay Marketplaces

97 millionactive buyers and sellers world wide

200+ million itemsin more than 50,000 categories

2 billion page viewseach day

9 petabytes of datain our Hadoop and Teradata clusters

250 million querieseach day to our search engine

Page 3: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

CassinieBay’s new Search

EngineEntirely new codebase

World-class, from a world class team

Platform for ranking innovation

Four major tracks, 100+ engineers

Likely launch in 2012

Page 4: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Indexing in Cassini

Index with more data and more history

More computationally expensive work at index-time (and less at query-time)

Ability to rescore and reclassify entire site inventory

The entire site inventory is stored in HBase

Indexes are built via MapReduce jobs and stored in HDFS

Build the entire site inventory in hours

Page 5: HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Page 6: HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Page 7: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Hbase Table Data Import

Bulk Load Batch processing on demand or every couple of

hours Load a large amount of data quickly

PUT Near real time updates Better for updating small amount of data Read after PUT for better random read

performance

Page 8: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

HBase Tables

3 major tables: active items, completed items and sellers

15TB data

3600 pre-split regions per table with auto-split disabled

3 column families with maximum 200 columns

Automatic major compaction disabled

RowKey is bit reversal of document id (unsigned 64-bit integer)

Page 9: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Indexing Job Pipeline

Full table scan

Run every couple of hours

Page 10: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Numbers

Data import Bulk data import: 30 minutes for 500 million full

rows Random write: ~ 200,000,000 rows per day 1.2 TB data daily import

Scan Performance Scan speed: 2004 rows per second per region

server (average version 3), 465 rows per second per region server (average version 10)

Scan speed with filters: 325~353 rows per second per region server

Page 11: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Operations

Monitoring Ganglia Nagios OpenTSDB

Testing Unit test and regression test

HBaseTestingUtility for unit test Standalone Hbase for regression test (mvn verify)

Cluster level Fault Injection Tests [HBASE-4925]

Region balancer

Manual major compaction

Page 12: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Operations (Cont’d)

Disable swap

Largely increase file descriptor limit and xciever count

Metrics Watch for

jvm.DataNode.metrics.threadRunnablewith netstat

Connection leakage

hbase.regionserver.compactionQueueSize

Major/minor compactions

dfs.datanode.blockReports_avg_time Data block reporting (for too many data blocks)

network_report Network bandwidth usage (for data locality)

Page 13: HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Community Acknowledgement

Eli Collins

Kannan Muthukkaruppan

Karthik Ranganathan

Konstantin Shvachko

Lars George

Michael Stack

Ted Yu

Todd Lipcon