not your dad's h base new
TRANSCRIPT
Not your Dad’s Old HBaseGilad Moscovitch - Senior Consultant UXC PS
@moscovigYaniv Rodenski - Principal Consultant UXC PS
@YRodenski
AgendaOur use cases
Introduction to Apache Phoenix
The first use case - retrospective
Managing a large scale Graph with TitanDB
The second use case - retrospective
The Cable CompanyOur story starts with a cable company that grew:
Over a decade ago, bought an ISP
Bought a mobile network
Started new ventures such as VOD and VoIP
Our DatasetBillions of records (PB scale)
Countless number of formats:
Multiple systems
Network equipment
Devices
Dynamic data model
New devices are introduced frequently (on average every two weeks)
New demands are introduced even more frequently
Challenges The Oracle Data Warehouse and ODI could not handle the load
ETL devs could not handle the load, the ETL team became a bottleneck
Not all data types arrive at the warehouse
We had to prioritise due to lack of ETL devs
Incompatibility with the existing data model
Changes to the data model would take an average of a month
Even when data was loaded, analysts were not aware of the new tables, and we ended up with an unusable schema
More Challenges New data models that are not a good fit for SQL databases:
Sparse data
Geospatial data
Full text
Graph
Need to ask harder questions that require heavy processing:
Machine learning
Breaking OutThe new data platform was Hadoop based
Using CDH (at that time the most advanced option)
Trying to reuse existing components of the platform as much as possible
Challenge #1: Early Data Access
Giving analysts, BI developers and business access to raw data
For this use case we reviewed a few tools, including Apache Phoenix
Apache Phoenix - SQL on HBase
Apache Phoenix is a relational database layer over HBase with a difference:
Table metadata is stored in an HBase table and versioned, snapshot queries over prior versions will automatically use the correct schema
Secondary indexes
Dynamic columns with schema on read
Views
Indexed
Updatable
Challenge no 1: ResultsIn addition to Phoenix we also looked at Hive and Impala
Spark SQL, Presto and Drill were not considered due to immaturity
Impala was chosen
Schema on read was important
Hive on CDH doesn’t support Tez
Apache Phoenix was overkill and better suited to be a database rather than a warehouse
Challenge no 2: Family Time
Clients are never represented by a single entity:
Households
Business
Clients have multiple devices generating data:
Home and mobile phones
IP adresses for devices
DVRs
Titan - A Distributed Graph
Titan is a scalable graph database
Optimized for storing and querying graphs
Runs on top of:
Cassandra
HBase
DynamoDB
BerkeleyDB
Support for geo, numeric range, and full-text search via:
ElasticSearch
SolR
Supports Gremlin - a graph querying DSL via
Tinkerpop Gremlin over HTTP
Challenge #2: Testing StageHbase vs Cassandra benchmark + sanity check
Simulation for 1 billion Vertices
Sanity check- OK
Not much difference in loading time and querying time on both stores
HBase chosen because of the existing infrastructure
Retrospective: 1 billion Vertices on an empty graph didn’t really simulate anything.
Challenge #2: POC StageInitializing an untuned Hbase Cluster on all 24 nodes of the existing cluster
Hosted side by side with Map Reduce and Impala
Developing initial ontology for the largest data source together with a developer from the client application team
Developing Map Reduce for loading hundreds of GB a day according to the ontology
POC PerformanceInput Data was stored in hourly directories so at first we scheduled the Map Reduce for each hour.
An hour took about 40 minutes to process and load.
Later on - scheduled the Map-Reduce for a whole day at a time. The whole day loading took about half a day.
Retrospective: Such long Map-Reduce jobs create new challenges - Hold lots of reducers for a long time, not fun to re-run in case of cluster failures.
Performance TuningHBase didn't handle the load, the symptoms included
HBase write-blocking compactions
Retired region servers
Tuning performed:
Region split size - split after 11 GB
Memstore flush size tuning
GC Tuning
Java Heap size decreasing from 32 to 16
Daily major compaction for the graph table
Retrospective: We had to statically partition to two different clusters: One for HBase, and one for everything else
TodayThe main graph ingests:
~1.7 billion edges
~1.7 billion vertices
The main graph size is 20TB
20 region servers
Rebuilding the graph on average every 3 months for new ontology
New data sources are added within a day by one (awesome) developer
Using a web based UI tool for graph explorationRetrospective: Titan on HBase works pretty well for those sizes
SummaryHBase is a versatile datastore
Apache Phoenix modernises HBase with semi-relational SQL layer
Titan provides powerful graph capabilities
Never be naive about Big Data tools, they will bite you, badly