spark + hbase

22
Spark + HBase Bringing HBase Data Efficiently into Spark with DataFrame Support Zhan Zhang Software Engineer 04/08/2016

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

1.890 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Spark + HBase

Spark + HBaseBringing HBase Data Efficiently into Spark with DataFrame Support Zhan ZhangSoftware Engineer04/08/2016

Page 2: Spark + HBase

Page 2 © Hortonworks Inc. 2014

About Zhan Zhang

Zhan Zhang (Software Engineer at Hortonworks)

Currently Focus on Apache Spark and Hadoop, etc

Contribute to Apache Spark, Yarn, HBase, Ambari, etc

Experiences on Computer Networks, Distributed System and Machine Learning Platform

Page 3: Spark + HBase

Page 3 © Hortonworks Inc. 2014

Why Revamp the Existing HBase Connector?

Limited Spark Support in HBase Upstream– Scalability– RDD level, but Spark is moving to DataFrame/Dataset– Data Loss and Data Duplication

Stability– Correctness– Stability Impact with Co-processor.– Serialized RDD Lineage to HBase– Maintenance Overhead: Internal Hacks

Page 4: Spark + HBase

Page 4 © Hortonworks Inc. 2014

What Improvement Have We Made? Combine Spark and HBase

– Spark Catalyst Engine for Query Plan and Optimization– HBase for Fast Access KV Store– Implement Standard External Data Source with Built-in Filter

High Performance– Data Locality: Move Computation to Data– Partition Pruning: Task only Performed in RS Holding Requested Data– Column Pruning / Predicate Pushdown: Reduce Network Overhead

Full Fledged DataFrame Support– Spark-SQL– Integrated Language Query

Run on Top of Existing HBase Table– Native Support Java Primitive Types

Page 5: Spark + HBase

Page 5 © Hortonworks Inc. 2014

More …

Composite Key

Avro Format

Customized Serdes

Page 6: Spark + HBase

Page 6 © Hortonworks Inc. 2014

Usage - Define the Catalog

Header (Calibri Bold 28 pt)

Page 7: Spark + HBase

Page 7 © Hortonworks Inc. 2014

Usage– Write to HBase

Page 8: Spark + HBase

Page 8 © Hortonworks Inc. 2014

Usage– Construct DataFrame

Page 9: Spark + HBase

Page 9 © Hortonworks Inc. 2014

Usage - Language Integrate Query

Page 10: Spark + HBase

Page 10 © Hortonworks Inc. 2014

Usage - Spark SQL

Page 11: Spark + HBase

Page 11 © Hortonworks Inc. 2014

Usage - With Other Data Sources

Page 12: Spark + HBase

Page 12 © Hortonworks Inc. 2014

Page 13: Spark + HBase

Page 13 © Hortonworks Inc. 2014

Header (Calibri Bold 28 pt)

Page 14: Spark + HBase

Page 14 © Hortonworks Inc. 2014

Spark HBase Connector Architecture

Page 15: Spark + HBase

Page 15 © Hortonworks Inc. 2014

Byte Array Order: SHORT/INT/LONG

0 21 … … MAX -2 -1MIN … …

WHERE X <= 2

WHERE X >= -2

Page 16: Spark + HBase

Page 16 © Hortonworks Inc. 2014

Implementation

Partition Pruning: – Split into Multiple Range, e.g., WHERE X < 2

Data Locality: – Each RDD Partition Has Preferred Location

Column Pruning: – Required Column in Scan/BulkGet

Predicate Pushdown: – HBase Built-in Filters

Scan/BulkGets: – Grouped by Region Server

Page 17: Spark + HBase

Page 17 © Hortonworks Inc. 2014

Page 18: Spark + HBase

Page 18 © Hortonworks Inc. 2014

Page 19: Spark + HBase

Page 19 © Hortonworks Inc. 2014

BACK UP

Page 20: Spark + HBase

Page 20 © Hortonworks Inc. 2014

Kerberos Cluster Kerberos Ticket

Token Retrieval and Renewal

Long Running Service

Page 21: Spark + HBase

Page 21 © Hortonworks Inc. 2014

FLOAT/DOUBLE: IEEE-754

0.0 0.2… … … MAX -2.0… MIN…

WHERE X <= 2.0D

WHERE X >= -2.0D

-0.0

Page 22: Spark + HBase

Page 22 © Hortonworks Inc. 2014

HBase Meta Table