not your dad's h base new

Not your Dad’s Old HBaseGilad Moscovitch - Senior Consultant UXC PS

@moscovigYaniv Rodenski - Principal Consultant UXC PS

@YRodenski

AgendaOur use cases

Introduction to Apache Phoenix

The first use case - retrospective

Managing a large scale Graph with TitanDB

The second use case - retrospective

The Cable CompanyOur story starts with a cable company that grew:

Over a decade ago, bought an ISP

Bought a mobile network

Started new ventures such as VOD and VoIP

Our DatasetBillions of records (PB scale)

Countless number of formats:

Multiple systems

Network equipment

Devices

Dynamic data model

New devices are introduced frequently (on average every two weeks)

New demands are introduced even more frequently

The Cable Guys:

Gilad MoscovitchEngineering Manager

Yaniv RodenskiArchitect in the CTO team

Our Starting Point:

Devices

Systems of Records

ETL via ODI

Oracle Exadata

Challenges The Oracle Data Warehouse and ODI could not handle the load

ETL devs could not handle the load, the ETL team became a bottleneck

Not all data types arrive at the warehouse

We had to prioritise due to lack of ETL devs

Incompatibility with the existing data model

Changes to the data model would take an average of a month

Even when data was loaded, analysts were not aware of the new tables, and we ended up with an unusable schema

More Challenges New data models that are not a good fit for SQL databases:

Sparse data

Geospatial data

Full text

Graph

Need to ask harder questions that require heavy processing:

Machine learning

Breaking OutThe new data platform was Hadoop based

Using CDH (at that time the most advanced option)

Trying to reuse existing components of the platform as much as possible

Challenge #1: Early Data Access

Giving analysts, BI developers and business access to raw data

For this use case we reviewed a few tools, including Apache Phoenix

Apache Phoenix - SQL on HBase

Apache Phoenix is a relational database layer over HBase with a difference:

Table metadata is stored in an HBase table and versioned, snapshot queries over prior versions will automatically use the correct schema

Secondary indexes

Dynamic columns with schema on read

Views

Indexed

Updatable

Demo - Apache Phoenix

Challenge no 1: ResultsIn addition to Phoenix we also looked at Hive and Impala

Spark SQL, Presto and Drill were not considered due to immaturity

Impala was chosen

Schema on read was important

Hive on CDH doesn’t support Tez

Apache Phoenix was overkill and better suited to be a database rather than a warehouse

Challenge no 2: Family Time

Clients are never represented by a single entity:

Households

Business

Clients have multiple devices generating data:

Home and mobile phones

IP adresses for devices

DVRs

Titan - A Distributed Graph

Titan is a scalable graph database

Optimized for storing and querying graphs

Runs on top of:

Cassandra

HBase

DynamoDB

BerkeleyDB

Support for geo, numeric range, and full-text search via:

ElasticSearch

SolR

Supports Gremlin - a graph querying DSL via

Tinkerpop Gremlin over HTTP

Demo - Clash of the Titan

Challenge #2: Testing StageHbase vs Cassandra benchmark + sanity check

Simulation for 1 billion Vertices

Sanity check- OK

Not much difference in loading time and querying time on both stores

HBase chosen because of the existing infrastructure

Retrospective: 1 billion Vertices on an empty graph didn’t really simulate anything.

Challenge #2: POC StageInitializing an untuned Hbase Cluster on all 24 nodes of the existing cluster

Hosted side by side with Map Reduce and Impala

Developing initial ontology for the largest data source together with a developer from the client application team

Developing Map Reduce for loading hundreds of GB a day according to the ontology

POC PerformanceInput Data was stored in hourly directories so at first we scheduled the Map Reduce for each hour.

An hour took about 40 minutes to process and load.

Later on - scheduled the Map-Reduce for a whole day at a time. The whole day loading took about half a day.

Retrospective: Such long Map-Reduce jobs create new challenges - Hold lots of reducers for a long time, not fun to re-run in case of cluster failures.

Performance TuningHBase didn't handle the load, the symptoms included

HBase write-blocking compactions

Retired region servers

Tuning performed:

Region split size - split after 11 GB

Memstore flush size tuning

GC Tuning

Java Heap size decreasing from 32 to 16

Daily major compaction for the graph table

Retrospective: We had to statically partition to two different clusters: One for HBase, and one for everything else

TodayThe main graph ingests:

~1.7 billion edges

~1.7 billion vertices

The main graph size is 20TB

20 region servers

Rebuilding the graph on average every 3 months for new ontology

New data sources are added within a day by one (awesome) developer

Using a web based UI tool for graph explorationRetrospective: Titan on HBase works pretty well for those sizes

SummaryHBase is a versatile datastore

Apache Phoenix modernises HBase with semi-relational SQL layer

Titan provides powerful graph capabilities

Never be naive about Big Data tools, they will bite you, badly

Next month:

Karel AlfonsoApache Flink Ned Shawa

Apache NiFi

not your dad's h base new

Software