scalable data warehousing on hadoop - bi...

Scalable Data Warehousing on HadoopZsolt Fekete

Budapest Dataforum, 2017

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaScalable Data Warehousing on Hadoop

Hadoop Ecosystem

Hive

Solution Architecture


Apache Hadoop History

Google papers– 2003, GFS distributed filesystem

– 2004, Map-Reduce computation model/system

– Key idea: distributed computation on commodity hardware

2006, Yahoo! Implements Hadoop, makes it open source– Hadoop = HDFS + MapReduce

Big Data hype starts


Apache Hive History – The Beginning

2007, Facebook started developing Hive: petabyte scale SQL over Hadoop

2008, Hive became open source

Designed for batch processing

Tables cannot be modified, no update, no delete

SQL compiled to MapReduce

Performance limitations of MapReduce


What do we expect from an EDW solution?

Scalable storage: HDFS

Fast, scalable SQL engine: Apache Hive

Security– Authentication, Authorization: Kerberos, Apache Ranger, Apache Knox

– Encrypted storage, encrypted communication: HDFS TDE, wire encryption

– Data governance: Apache Atlas

BI, cubes, data science: Apache Spark, Apache Zeppelin, Druid

Monitoring, configuration, deployment: Apache Ambari

Data ingestion: Apache Sqoop, Apache Storm

Data Lifecycle management: Apache Falcon


Data Access

User ID Region Total Spend

1 East 5,131

2 East 27,828

3 West 55,493

4 West 7,193

5 East 18,193

Example: Ranger, Per-User Row Filtering by Region in Hive

User 2

(East Region)

User 1

(West Region)

Original Query:

SELECT * from CUSTOMERS

WHERE total_spend > 10000

Query Rewrites based on

Dynamic Ranger PoliciesDynamic Rewrite:



AND region = “east”

Dynamic Rewrite:



AND region = “west”


Hortonworks Data Platform (HDP)

Enterprise ready integration of 100% Open Source projects

HDFS, Hive, Ranger, Atlas, Knox, Sqoop, Ambari, Spark, Zeppelin, Druid, etc…

Why Apache Software Foundation?

Cloud solutions:– Azure HDInsight

– Cloudbreak (for AWS, Azure, Google Cloud)

– Hortonworks Data Cloud (HDC) on AWS marketplace



Hadoop Ecosystem

Hive



Apache Hive Story

May, 2013: release 0.11.0– ORC format, Facebook 300+ PB

April, 2014: release 0.13.0– Apache Tez, vectorization, up to 100x perfomance improvement

November, 2014: release 0.14.0– ACID: insert, update, delete

June 2016 : release 2.1.0– LLAP: Low Latency Analytical Processing


Hive ACID Production-Ready with HDP 2.5

Tested at multi-TB scale using TPC-H benchmark.– Reliably ingest 400GB+ per day within a

partition.

– 10TB+ raw data in a single partition.

– Simultaneous ingest, delete and query.

70+ stabilization improvements.

Supported:– SQL INSERT, UPDATE, DELETE.

– Streaming API.

HDP-2.6: SQL MERGE under development (HIVE-10924).

Notable Improvements

0 MB

1 TB

1 TB

2 TB

2 TB

3 TB

3 TB

4 TB

4 TB

5 TB

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

5.24.16 5.25.16 5.26.16 5.27.16 5.28.16 5.29.16 5.30.16 5.31.16 6.1.16

Tim

e (

s)

Query Time versus Data Size

Runtime for All Queries (s) Total Compressed Data

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

5.23.16 5.24.16 5.25.16 5.26.16 5.27.16 5.28.16 5.29.16 5.30.16 5.31.16 6.1.16

Tim

e (

s)

Times for Inserts and Deletes

time_insert_lineitem time_insert_orders time_delete_lineitem time_delete_orders


Data Types SQL Features File Formats HDP 2.6Numeric Core SQL Features Columnar ACID MERGE

FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery

DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries

INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins

BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT

String UNION ALL Logfile

CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs

BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints

Date, Time UNION DISTINCT JSON Default Values

DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi Table Transactions

Complex Types OLAP and Windowing Functions Custom Formats

ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features

Nested Data Analytics CUBE and Grouping Sets XPath Analytics

Nested Data Traversal ACID Transactions

Lateral Views INSERT / UPDATE / DELETE

Procedural Extensions Constraints

HPL/SQL Primary / Foreign Key (Non Validated)

Apache Hive: Journey to SQL:2011 Analytics

Legend

HDP 2.5

Projected: HDP 3.0

HDP 2.6

Track Hive SQL:2011 Complete: HIVE-13554


Hive 2 with LLAP: Architecture Overview

Dee

p

Sto

rage

YARN Cluster

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

QueryCoordinators

Coord-inator

Coord-inator

Coord-inator

HiveServer2 (Query

Endpoint)

ODBC /JDBC

SQLQueries In-Memory Cache

(Shared Across All Users)

HDFS and Compatible

S3 WASB Isilon


Hive 2 with LLAP: 7x Performance Boost at 10 TB Scale

0

200

400

600

800

1000

1200

1400

1600

0

5

10

15

20

25

30

35q

uer

y52

qu

ery1

2

qu

ery5

5

qu

ery8

2

qu

ery7

9

qu

ery7

9

qu

ery9

1

qu

ery7

3

qu

ery6

6

qu

ery5

8

qu

ery4

9

qu

ery4

8

qu

ery4

2

qu

ery3

qu

ery7

qu

ery4

3

qu

ery4

5

qu

ery1

9

qu

ery2

0

qu

ery2

6

qu

ery4

6

qu

ery8

9

qu

ery2

5

qu

ery9

3

qu

ery9

0

qu

ery3

4

qu

ery1

5

qu

ery1

3

qu

ery8

5

qu

ery3

9

qu

ery2

7

qu

ery4

0

qu

ery3

2

qu

ery9

8

qu

ery8

4

qu

ery8

7

qu

ery6

8

qu

ery9

6

qu

ery1

7

qu

ery2

1

qu

ery5

0

qu

ery8

8

qu

ery7

1

qu

ery6

4

qu

ery7

6

Qu

ery

Ru

nti

me

(s)

Imp

rove

me

nt

Vs.

HD

P 2

.5 (

Rat

io)

HDP 2.5 with LLAP: 7x Performance Improvement Across All Query Types(10 TB, 10x d2.8xlarge EC2 Nodes, TPC-DS Queries)

Runtime (s) Improvement versus HDP 2.4


0

5

10

15

20

25

30

35

40

45

50

0

50

100

150

200

250

Spe

edu

p (

x Fa

cto

r)

Qu

ery

Tim

e(s)

(Lo

wer

is B

ette

r)

Hive 2 with LLAP averages 26x faster than Hive 1

Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale


Apache Hive vs. Apache Impala at 10TB

10TB scale on 10 identical AWS nodes.

Hive and Impala showed similar times on most smaller queries.

Hive scaled better, with many queries completing in <2m where Impala ran to timeout (3000s).

Highlights


Apache Hive vs. Presto on a partitioned 1TB dataset.

Presto lacks basic performance optimizations like dynamic partition pruning.

On a real dataset / workload Presto perform poorly without full re-writes.

Example: Query 55 without re-writes = 185.17s, with re-writes = 16s. LLAP = 1.37s.

Highlights


Apache Hive: Fast Facts

Most Queries Per Hour

100,000 Queries Per Hour(Yahoo Japan)

Analytics Performance

100 Million rows/s Per Node(with Hive LLAP)

Largest Hive Warehouse

300+ PB Raw Storage(Facebook)

Largest Cluster

4,500+ Nodes(Yahoo)


Roadmap At A Glance

Scalable DW in HadoopHDP 2.5 (GA) HDP 2.6 (GA) Beyond HDP 2.6

Fast BI

• LLAP Technical Preview (25x performance improvements)

• Primary Key / Foreign Key

• LLAP GA• Vectorized Decimal• SSD Cache• Cache Text Data in LLAP

• Materialized Views• Druid tables as Hive Indexes for fast

drill-down.• Fine-Grained Resource

Management.

SQL / EDW

• OLAP Improvements: Multi partition and ordering keys, order by aggregations.

• ACID MERGE• SQL: Cross Product, Multi Subquery,

TPC-DS Complete

• Column NOT NULL / Defaults• Surrogate Key Generation• Multi-Statement Transactions• Improved HPL/SQL• Better Unicode support

Cloud• Cloud Templates for ETL and

Presentation Layers• LLAP Template for Hortonworks Data

Cloud• Full ACID support for S3 / WASB• Replication / DR

Operations• Grafana Dashboards • Hive View: DBA Tooling

• Tez UI: Hive-Oriented Search• Activity Monitoring.• Schema Recommendations.

Current


What Is Apache Hive Now?

Apache Hive is a SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud.

Features:

• Extensive SQL:2011 Support

• ACID Transactions

• In-Memory Caching

• Cost-Based Optimizer

• User-Based Dynamic Security

• Replication and Disaster Recovery

• JDBC and ODBC Support

• Compatible with every major BI Tool

• Proven at 300+ PB Scale



Hadoop Ecosystem

Hive



Typical Legacy EDW ImplementationsBefore Connected Data Platforms


Typical Legacy EDW Implementationsend state post EDW Optimization


Scalable Data Warehousing on HadoopC

apab

iliti

es

Batch SQL OLAP / CubeInteractive SQLSub-Second

SQLACID / MERGE

Ap

plic

atio

ns

• ETL• Reporting• Data Mining• Deep Analytics

• Multidimensional Analytics

• MDX Tools• Excel

• Reporting• BI Tools: Tableau,

Microstrategy, Cognos

• Ad-Hoc• Drill-Down• BI Tools: Tableau,

Excel

• Continuous Ingestion from Operational DBMS

• Slowly Changing Dimensions

Existing

Development

Emerging

Legend

Co

re

Platform

Scale-Out Storage

Petabyte Scale Processing

Core SQL Engine

Apache Tez: Scalable Distributed Processing

Advanced Cost-Based Optimizer

Connectivity

Advanced Security

JDBC / ODBC

ComprehensiveSQL:2011 Coverage

MDX


New: Hortonworks EDW Optimization Solution

SyncsortHigh-Performance Data Movement

HadoopScalable Storage and Compute

Hive LLAPHigh Performance SQL Data Mart

AtScale Intelligence PlatformOLAP Cubes for Higher Performance

Source Data Systems

Fast, scalable SQL analytics

Intelligent in-memory caching

Define OLAP cubes for 10x faster queries

Unified semantic layer for all BI tools

High performance data import

from all major EDW platforms

Pre-aggregateddata

... Or, full-fidelitydata

Hortonworks EDWOptimization Solutionmakes analytics onHadoop easier thanever


Accelerate Analytics with AtScale

• Analyze data directly in HDP.• Use any BI Tool.• Unified Semantic Layer.• Support directly from Hortonworks.


Overview: Syncsort DMX-h

• Simple drag-and-drop ETL pipelines.• Connects to all major data sources in addition to Hadoop.• Integrated with Ranger, Atlas integration in development.


Adopting Hadoop EDW solution

New technology is needed for data processing which fits better in Hadoop ecosystem– Unstructured data

– Computing inverse index

– Etc…

Archiving data– HDFS is a low cost storage solution

– On par with tape backup solutions

Keep more data– Longer time window

– No need to reduce data

Move cold data from EDW to Hadoop


Adopting Hadoop EDW solution

Offload ETL jobs in the current EDW– Save CPU in existing EDW deployment, focus it to the real critical tasks

After adapting Hadoop storage– Possible to add new data sources

Analysis is still possible– Hive LLAP

– AtScale

– Integration with BI tools

– Druid


Thanks for the Attention!

Questions?

scalable data warehousing on hadoop - bi...

Documents