hbasecon 2013: evolving a first-generation apache hbase deployment to second generation and beyond

© 2013 Experian Limited. All rights reserved.

HBaseCon 2013Application Track – Case Study

Experian Marketing ServicesETL for HBase


Manoj KhanwalkarChief ArchitectExperian Marketing Services, New York

Govind AsawaBig Data ArchitectExperian Marketing Services, New York

Who We Are


1. About Experian Marketing Services

2. Why HBase

3. Why custom ETL

4. ETL solution features

5. Performance

6. Case Study

7. Conclusion

Agenda


Experian Marketing Services

1 billion+ messages daily

2000+ Institutional clients

9 regions, 24/7

500+ Tetabytes of data

200+ big queries

2000+ data export jobs

Email and social digital marketing messages.100% surge in volume during peak season

Across all verticals

Platforms operating globally

Client needs 1 to 7 years of marketing data depending on verticals

Complicated queries on 200+ million records400+ columns for segmentation

Client needs daily incremental activity data


• Traditional RDBMS based solution is very challenging and cost prohibitive for the scale of operations

• In SaaS based multi-tenancy model we require schema flexibility to support thousands of clients with their individual requirements

• In majority of cases key based lookups can satisfy data extraction requirements (including range scans and filters) which is well supported by HBase

• Automatic sharding and horizontally scalable

• HBase provides a Java API which can be integrated with Experian’s other systems.

5

Why HBase

© 2013 Experian Limited. All rights reserved. 6

Why develop an Integrator toolkit?

Connectivity Environment Cost• Ability to ingest and

read data from HBaseand MongoDB

• Connectors for cloud computing

• Support for REST and other industry standard API’s

• Supports SaaS Model

• Dynamically handles data input changes (# of fields & new fields)

• Licensing

• Integrate with other systems seamlessly thus improving time to market

• Resources required to develop, administer and maintain solution

• Major ETL vendors do not support HBase

• ETL solution needs extensive development if data structure changes which negates

advantages offered by No-SQL solution


Integrator Architecture

Dat

a In

gest

er

Targ

et S

yste

ms

Third Party

JMS

Database

Sou

rce

Syst

ems

Co

nn

ecto

rs

CSV Reader

Processor

Event Listener

Message Broker

File Watcher

Parser Factory

Key Generator

Parser

Loader

RDBMS Loader

HBase Loader

Container

Metadata

Analyzer

Loader

Aggregator

RDBMS

HBase

Extractor

Query Output

Aggregate Aware

Stamping Transform

SaaS

JMS

Files

RDBMS

HBase

RDBMS

HBase

MongoDB


Extractor Architecture

HDFS

Integrator

Send Data Click Data Bounce Data TXN Data

Metadata Detailed data Aggregates

HBaseWeb

ServerReportingAnalytics

Extractor

QueryOptimizer


Data ingestion from multiple sources• Flat files

• NO-SQL

• RDBMS (through JDBC)

• SaaS (Salesforce etc.)

• Messaging and any system providing events streaming

Ability to de-normalize fact table while ingesting data• # of lookup tables can be configured

Near real time generation of aggregate table• # of aggregate tables can be configured

• HBase counters are used to keep aggregated sum/count

• Concurrently aggregates can be populated in RDBMS of choice

9

Integrator & Extractor


Transformation of column value to another value• Add column by transformation

• Drop columns from input stream if no persistence is required

Data filter capability • Drop record while ingesting base table

• Drop record while aggregation

Aggregate aware optimized query execution• Query Performance: Analyze column requested by user in query and determine based

on count table with minimum record which can satisfy this requirement.

• Transparent: No user intervention or knowledge of schema is required

• Optimizer: Conceptually similar to RDBMS query plan optimizer. Concept extended to No-SQL databases

• Metadata Management: Integrated metadata with ETL process can be used by variety of applications.

10

Integrator & Extractor


Framework• Solution based on Spring as a light weight container and built a framework around it to

standardize on the lifecycle of the process and to enable any arbitrary functionality to reside in the container by implementing a Service interface.

• The container runs in a batch processing or daemon mode.

• In the daemon mode , it uses the Java 7 File Watcher API to react to files placed in the specified directory for processing.

Metadata catalogue• Metadata about all HBase table in which data ingested is stored

• For each table primary key, columns and record counter is stored

• HBase count is brute force scan and expensive API call. This can be avoided if metadata is published at the time of data ingestion

• Avoid expensive queries which can bring cluster to its knees

• Provide faster query performance

11

Integrator


• We used a 20 node cluster in production; each node had 24 cores with a 10GigE network backbone.

• We observed a throughput of 1.3 million records inserted in HBase per minute per node.

• Framework allowed us to run ETL process on multiple machines thus providing horizontal scalability.

• Most of our queries returned back in at most a few seconds.

12

Integrator – System Performance


• Our experience shows that HBase offers a cost effective and performance solution for managing our data explosion while meeting the increasingly sophisticated analytical and reporting requirements of clients.

• ETL framework allows us to leverage HBase and its features while improving developer productivity.

• Framework gives us ability to roll out new functionality with minimum time to market.

• Metadata catalogue optimizes query and improves cluster performance

• Select count() on big HBase table take minutes/hours and can bring cluster to knees. Metadata of Integrator will give counts along with PrimaryKey, Columns in milliseconds

13

Conclusion


• Case Study

14

Appendix


HBase Schema & Record

ClientID

CampaignID

Timelogged

UserID

Origdomain

Rcptdomain

DSstatus

Bouncecat

IP Timequeued

1 11 01/01/13 21 abc.com gmail.com success 192.168.6.23

01/01/2013

2 12 01/02/13 31 xyz.com yahoo.com success bad-mailbox

112.168.6.23

01/01/2013

Fact Table send

Send Recordclient_id,campaign_id,time_logged,user_id,orig_domain,rcpt_domain,dsn_status,bounce_cat,ip,Time_queued1,11,01/01/2013,21,abc.com,gmail.com,success,192.168.6.23,01/01/2013



Fact Table activity

Activity Recordclient_id,campaign_id,event_time,user_id,event_type1,11,01/01/2013,21,open

ClientID

CampaignID

Timelogged

UserID

Origdomain

Rcptdomain

IP city Event type

IP Sendtime

1 11 01/01/13 21 abc.com gmail.com SFO Open 192.168.6.23

01/01/2013

2 12 01/04/13 31 xyz.com yahoo.com LA Click 112.168.6.23

01/01/2013



Dimension Table demographics

Dimension Table ip

Client ID User ID Date Age Gender State City Zip Country Flag

1 11 01/01/13 21 M CA SFO 94087 USA Y

2 12 01/02/13 31 M CA SFO 94087 USA N

IP Date Domain State Country City

192.168.6.23 01/01/2013 gmail.com CA USA SFO

112.168.6.23 01/02/2013 abc.edu NJ USA Newark



Aggregate Table A1

Aggregate Table A2

Campaign ID Date Gender State Country Count

11 01/01/13 M CA USA 5023

12 01/02/13 M CA USA 74890

Client ID Date Gender State Country Count

1 01/01/13 M CA USA 742345

2 01/02/13 M CA USA 1023456


Metadata

Metadata Table

Table Name Primary Key Columns Count

demographics Client_id,Campaign_id,Date

Client_id, Campaign_id, Date, Age, Gender,State,City,Country,Flag

10,000,000

A1 Campaign_id,Date Campaign_id,Date,Gender,State,Country,Count 1,000,000

A2 Client_id,Date Client_id,Date,Gender,State,Country,Count 500,000


User Query without Extractor Aggregate Awareness

• Select client_id,state,count from demographics

• Query Execution: Query will be executed on demographics table which has 300,000,000 rows

User Query with Extractor Aggregate Awareness

• Select client_id,state,count from demographics

• Query Execution:

– Step 1: Extractor will parse list of columns from query

– Step 2: Extractor will find list of tables which has these columns. In this example extractor will get 2 tables demographics and A1 which can satisfy this query request

– Step 3: Extractor will decide which is best table to satisfy this query. This decision will be based on # of rows in table. In this example table A1 has less # of rows compared to table demographics so table A1 will be selected

– Step 4: Query will be executed against table A1 with appropriate where clause specified by user

20

Query Execution in Action


• Bloom filters were enabled at the row level to enable HBase to skip files efficiently.

• We used HBase filters extensively in the Scans to filter out as much data as possible on the server side.

• Defined Aggregates judiciously to be able to respond to queries without requiring HBase to resort to large file scans..

• We used a key concatenation that aligned to expected search patterns to enable HBase to provide an exact match or do efficient key range scans when a partial key was provided.

21

HBase Design Considerations


• We didn’t use MapReduce in our ETL framework for following considerations

– Overhead of MapReduce based processes.

– Real-time access to data

– Every file had different header metadata , in MapReduce we had difficulty in passing header metadata to each Map process

– Avoid intermediate reads and writes to the HDFS file system.

22

HBase Design Considerations


• We broke the Input and Output processing into separate threads and allocated a lot more threads for output processing to compensate for the relative processing speeds.

• Batched the Writes to HBase to reduce number of calls to the server

• Turned off the WAL in HBase , since we could always reprocess the file in case of a rare failure

• Used primitives and Arrays in the code where feasible instead of Java Objects and Collections, to reduce the memory footprint and the pressure on the Garbage collector.

23

HBase Tuning


• Increased the Client Write Buffer size to several megabytes.

• To avoid hotspots and best data retrieval we designed composite primary key. Key design allowed us to access data by providing exact key or range scan by leading portion of key.

• We found that too many filters for scan provides diminishing returns and after some point it degrades the overall scan performance

24

HBase Tuning


Thank youFor more information, please contact

Manoj KhanwalkarChief Architect

[email protected]

Govind AsawaBig Data Architect

[email protected]

hbasecon 2013: evolving a first-generation apache hbase deployment to second generation and beyond

Technology

experian limited

experian marketing services1

years of marketing data

data ingestion

data input changes

data structure changes

data export jobsemail

data extraction requirements