scalable data warehousing on hadoop - bi...

29
Scalable Data Warehousing on Hadoop Zsolt Fekete Budapest Dataforum, 2017

Upload: others

Post on 20-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

Scalable Data Warehousing on HadoopZsolt Fekete

Budapest Dataforum, 2017

Page 2: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaScalable Data Warehousing on Hadoop

Hadoop Ecosystem

Hive

Solution Architecture

Page 3: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hadoop History

Google papers– 2003, GFS distributed filesystem

– 2004, Map-Reduce computation model/system

– Key idea: distributed computation on commodity hardware

2006, Yahoo! Implements Hadoop, makes it open source– Hadoop = HDFS + MapReduce

Big Data hype starts

Page 4: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hive History – The Beginning

2007, Facebook started developing Hive: petabyte scale SQL over Hadoop

2008, Hive became open source

Designed for batch processing

Tables cannot be modified, no update, no delete

SQL compiled to MapReduce

Performance limitations of MapReduce

Page 5: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What do we expect from an EDW solution?

Scalable storage: HDFS

Fast, scalable SQL engine: Apache Hive

Security– Authentication, Authorization: Kerberos, Apache Ranger, Apache Knox

– Encrypted storage, encrypted communication: HDFS TDE, wire encryption

– Data governance: Apache Atlas

BI, cubes, data science: Apache Spark, Apache Zeppelin, Druid

Monitoring, configuration, deployment: Apache Ambari

Data ingestion: Apache Sqoop, Apache Storm

Data Lifecycle management: Apache Falcon

Page 6: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Access

User ID Region Total Spend

1 East 5,131

2 East 27,828

3 West 55,493

4 West 7,193

5 East 18,193

Example: Ranger, Per-User Row Filtering by Region in Hive

User 2

(East Region)

User 1

(West Region)

Original Query:

SELECT * from CUSTOMERS

WHERE total_spend > 10000

Query Rewrites based on

Dynamic Ranger PoliciesDynamic Rewrite:

SELECT * from CUSTOMERS

WHERE total_spend > 10000

AND region = “east”

Dynamic Rewrite:

SELECT * from CUSTOMERS

WHERE total_spend > 10000

AND region = “west”

Page 7: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks Data Platform (HDP)

Enterprise ready integration of 100% Open Source projects

HDFS, Hive, Ranger, Atlas, Knox, Sqoop, Ambari, Spark, Zeppelin, Druid, etc…

Why Apache Software Foundation?

Cloud solutions:– Azure HDInsight

– Cloudbreak (for AWS, Azure, Google Cloud)

– Hortonworks Data Cloud (HDC) on AWS marketplace

Page 8: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaScalable Data Warehousing on Hadoop

Hadoop Ecosystem

Hive

Solution Architecture

Page 9: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hive Story

May, 2013: release 0.11.0– ORC format, Facebook 300+ PB

April, 2014: release 0.13.0– Apache Tez, vectorization, up to 100x perfomance improvement

November, 2014: release 0.14.0– ACID: insert, update, delete

June 2016 : release 2.1.0– LLAP: Low Latency Analytical Processing

Page 10: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive ACID Production-Ready with HDP 2.5

Tested at multi-TB scale using TPC-H benchmark.– Reliably ingest 400GB+ per day within a

partition.

– 10TB+ raw data in a single partition.

– Simultaneous ingest, delete and query.

70+ stabilization improvements.

Supported:– SQL INSERT, UPDATE, DELETE.

– Streaming API.

HDP-2.6: SQL MERGE under development (HIVE-10924).

Notable Improvements

0 MB

1 TB

1 TB

2 TB

2 TB

3 TB

3 TB

4 TB

4 TB

5 TB

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

5.24.16 5.25.16 5.26.16 5.27.16 5.28.16 5.29.16 5.30.16 5.31.16 6.1.16

Tim

e (

s)

Query Time versus Data Size

Runtime for All Queries (s) Total Compressed Data

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

5.23.16 5.24.16 5.25.16 5.26.16 5.27.16 5.28.16 5.29.16 5.30.16 5.31.16 6.1.16

Tim

e (

s)

Times for Inserts and Deletes

time_insert_lineitem time_insert_orders time_delete_lineitem time_delete_orders

Page 11: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Types SQL Features File Formats HDP 2.6Numeric Core SQL Features Columnar ACID MERGE

FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi Subquery

DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar Subqueries

INT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-Equijoins

BOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT

String UNION ALL Logfile

CHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEs

BLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints

Date, Time UNION DISTINCT JSON Default Values

DATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi Table Transactions

Complex Types OLAP and Windowing Functions Custom Formats

ARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features

Nested Data Analytics CUBE and Grouping Sets XPath Analytics

Nested Data Traversal ACID Transactions

Lateral Views INSERT / UPDATE / DELETE

Procedural Extensions Constraints

HPL/SQL Primary / Foreign Key (Non Validated)

Apache Hive: Journey to SQL:2011 Analytics

Legend

HDP 2.5

Projected: HDP 3.0

HDP 2.6

Track Hive SQL:2011 Complete: HIVE-13554

Page 12: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: Architecture Overview

Dee

p

Sto

rage

YARN Cluster

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

LLAP Daemon

Query Executors

QueryCoordinators

Coord-inator

Coord-inator

Coord-inator

HiveServer2 (Query

Endpoint)

ODBC /JDBC

SQLQueries In-Memory Cache

(Shared Across All Users)

HDFS and Compatible

S3 WASB Isilon

Page 13: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive 2 with LLAP: 7x Performance Boost at 10 TB Scale

0

200

400

600

800

1000

1200

1400

1600

0

5

10

15

20

25

30

35q

uer

y52

qu

ery1

2

qu

ery5

5

qu

ery8

2

qu

ery7

9

qu

ery7

9

qu

ery9

1

qu

ery7

3

qu

ery6

6

qu

ery5

8

qu

ery4

9

qu

ery4

8

qu

ery4

2

qu

ery3

qu

ery7

qu

ery4

3

qu

ery4

5

qu

ery1

9

qu

ery2

0

qu

ery2

6

qu

ery4

6

qu

ery8

9

qu

ery2

5

qu

ery9

3

qu

ery9

0

qu

ery3

4

qu

ery1

5

qu

ery1

3

qu

ery8

5

qu

ery3

9

qu

ery2

7

qu

ery4

0

qu

ery3

2

qu

ery9

8

qu

ery8

4

qu

ery8

7

qu

ery6

8

qu

ery9

6

qu

ery1

7

qu

ery2

1

qu

ery5

0

qu

ery8

8

qu

ery7

1

qu

ery6

4

qu

ery7

6

Qu

ery

Ru

nti

me

(s)

Imp

rove

me

nt

Vs.

HD

P 2

.5 (

Rat

io)

HDP 2.5 with LLAP: 7x Performance Improvement Across All Query Types(10 TB, 10x d2.8xlarge EC2 Nodes, TPC-DS Queries)

Runtime (s) Improvement versus HDP 2.4

Page 14: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

0

5

10

15

20

25

30

35

40

45

50

0

50

100

150

200

250

Spe

edu

p (

x Fa

cto

r)

Qu

ery

Tim

e(s)

(Lo

wer

is B

ette

r)

Hive 2 with LLAP averages 26x faster than Hive 1

Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale

Page 15: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hive vs. Apache Impala at 10TB

10TB scale on 10 identical AWS nodes.

Hive and Impala showed similar times on most smaller queries.

Hive scaled better, with many queries completing in <2m where Impala ran to timeout (3000s).

Highlights

Page 16: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hive vs. Presto on a partitioned 1TB dataset.

Presto lacks basic performance optimizations like dynamic partition pruning.

On a real dataset / workload Presto perform poorly without full re-writes.

Example: Query 55 without re-writes = 185.17s, with re-writes = 16s. LLAP = 1.37s.

Highlights

Page 17: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Hive: Fast Facts

Most Queries Per Hour

100,000 Queries Per Hour(Yahoo Japan)

Analytics Performance

100 Million rows/s Per Node(with Hive LLAP)

Largest Hive Warehouse

300+ PB Raw Storage(Facebook)

Largest Cluster

4,500+ Nodes(Yahoo)

Page 18: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Roadmap At A Glance

Scalable DW in HadoopHDP 2.5 (GA) HDP 2.6 (GA) Beyond HDP 2.6

Fast BI

• LLAP Technical Preview (25x performance improvements)

• Primary Key / Foreign Key

• LLAP GA• Vectorized Decimal• SSD Cache• Cache Text Data in LLAP

• Materialized Views• Druid tables as Hive Indexes for fast

drill-down.• Fine-Grained Resource

Management.

SQL / EDW

• OLAP Improvements: Multi partition and ordering keys, order by aggregations.

• ACID MERGE• SQL: Cross Product, Multi Subquery,

TPC-DS Complete

• Column NOT NULL / Defaults• Surrogate Key Generation• Multi-Statement Transactions• Improved HPL/SQL• Better Unicode support

Cloud• Cloud Templates for ETL and

Presentation Layers• LLAP Template for Hortonworks Data

Cloud• Full ACID support for S3 / WASB• Replication / DR

Operations• Grafana Dashboards • Hive View: DBA Tooling

• Tez UI: Hive-Oriented Search• Activity Monitoring.• Schema Recommendations.

Current

Page 19: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What Is Apache Hive Now?

Apache Hive is a SQL data warehouse infrastructure that delivers fast, scalable SQL processing on Hadoop and in the Cloud.

Features:

• Extensive SQL:2011 Support

• ACID Transactions

• In-Memory Caching

• Cost-Based Optimizer

• User-Based Dynamic Security

• Replication and Disaster Recovery

• JDBC and ODBC Support

• Compatible with every major BI Tool

• Proven at 300+ PB Scale

Page 20: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaScalable Data Warehousing on Hadoop

Hadoop Ecosystem

Hive

Solution Architecture

Page 21: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Typical Legacy EDW ImplementationsBefore Connected Data Platforms

Page 22: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Typical Legacy EDW Implementationsend state post EDW Optimization

Page 23: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Scalable Data Warehousing on HadoopC

apab

iliti

es

Batch SQL OLAP / CubeInteractive SQLSub-Second

SQLACID / MERGE

Ap

plic

atio

ns

• ETL• Reporting• Data Mining• Deep Analytics

• Multidimensional Analytics

• MDX Tools• Excel

• Reporting• BI Tools: Tableau,

Microstrategy, Cognos

• Ad-Hoc• Drill-Down• BI Tools: Tableau,

Excel

• Continuous Ingestion from Operational DBMS

• Slowly Changing Dimensions

Existing

Development

Emerging

Legend

Co

re

Platform

Scale-Out Storage

Petabyte Scale Processing

Core SQL Engine

Apache Tez: Scalable Distributed Processing

Advanced Cost-Based Optimizer

Connectivity

Advanced Security

JDBC / ODBC

ComprehensiveSQL:2011 Coverage

MDX

Page 24: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

New: Hortonworks EDW Optimization Solution

SyncsortHigh-Performance Data Movement

HadoopScalable Storage and Compute

Hive LLAPHigh Performance SQL Data Mart

AtScale Intelligence PlatformOLAP Cubes for Higher Performance

Source Data Systems

Fast, scalable SQL analytics

Intelligent in-memory caching

Define OLAP cubes for 10x faster queries

Unified semantic layer for all BI tools

High performance data import

from all major EDW platforms

Pre-aggregateddata

... Or, full-fidelitydata

Hortonworks EDWOptimization Solutionmakes analytics onHadoop easier thanever

Page 25: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Accelerate Analytics with AtScale

• Analyze data directly in HDP.• Use any BI Tool.• Unified Semantic Layer.• Support directly from Hortonworks.

Page 26: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Overview: Syncsort DMX-h

• Simple drag-and-drop ETL pipelines.• Connects to all major data sources in addition to Hadoop.• Integrated with Ranger, Atlas integration in development.

Page 27: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Adopting Hadoop EDW solution

New technology is needed for data processing which fits better in Hadoop ecosystem– Unstructured data

– Computing inverse index

– Etc…

Archiving data– HDFS is a low cost storage solution

– On par with tape backup solutions

Keep more data– Longer time window

– No need to reduce data

Move cold data from EDW to Hadoop

Page 28: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Adopting Hadoop EDW solution

Offload ETL jobs in the current EDW– Save CPU in existing EDW deployment, focus it to the real critical tasks

After adapting Hadoop storage– Possible to add new data sources

Analysis is still possible– Hive LLAP

– AtScale

– Integration with BI tools

– Druid

Page 29: Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thanks for the Attention!

Questions?