future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved

Introducing the Total Data Warehouse

Matthew AslettResearch Director, Data Management and Analytics, 451 Research


Matthew Aslett• Research Director, Data Platforms and Analytics [email protected] www.twitter.com/maslett

Responsible for data management and analytics research agenda

Focus on operational and analytic databases, including NoSQL, NewSQL, and Hadoop

With 451 Research since 2007


Company Overview

One company with 3 operating divisions

Syndicated research, advisory, professional services, datacenter certification, and events

Global focus

270+ staff 1,500+ client organizations: enterprises, vendors, service providers, and investment firms

Organic and growth through acquisition


The rise of Apache Hadoop has been driven largely by demand for more flexible approaches to data management and analytics Overcoming the limitations of traditional analytic databases and their adherence to strictly defined schema.

Hadoop is largely complementary to existing data warehouse deployments

However, there is clear evidence that at least some workloads are being migrated from existing enterprise data warehouses to Hadoop

E.g. Teradata’s CEO noted in October 2013 that, on average, 20% of the total ETL workload on Teradata data warehouses could potentially move to Hadoop (4‐8% of the total Teradata data warehouse workload)

That has driven many people to question the extent to which Hadoop will replace the data warehouse

Hadoop and the data warehouse


Survey conducted: Sept/Oct 2013Sample: 98

Hadoop and the data warehouse

Hadoop not yet used

Hadoop for workloads not

previously on EDWTemporarily offloading

workloads to Hadoop

Permanently migrating

workloads to Hadoop

Hadoop replacing EDW

Describe the relationship between Hadoop and the enterprise data warehouse within your organization

Two‐thirds of Hadoop engagement is currently non‐threatening or additive to existing data warehouse deployments


Frames the question incorrectly based on an assumption that a ‘data warehouse’ is by default based on an analytic relational database

A data warehouse as an enterprise platform for storing, processing and analyzing data could be based on an analytic database, Hadoop, or a combination of the two

Hadoop is primarily used to handle unstructured and semi‐structured data not a good fit – in terms of economics and data formats – for analytic databases

The future analytic data‐processing landscape will be a hybrid of analytic databases and Hadoop each used where appropriate for the individual analytic use case.

Hadoop replacing the data warehouse?


There are various phrases used to describe this hybrid landscape in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse

The primary platforms in a Total Data Warehouse are expected to be analytic databases and Hadoop

However we also expect to see the Total Data Warehouse comprise other data storage and processing platforms Exploratory analytics/discovery platforms Search Graph processing Stream processing Machine learning Log processing NoSQL databases NewSQL databases

Introducing the Total Data Warehouse


PRE‐DEFINED REPORTING

AD HOC ANALYTICS

STATISTICALANALYTICS

PREDICTIVEANALYTICS

MACHINE LEARNING MAPREDUCE

SEARCH‐BASED

ANALYTICS

GRAPH ANALYTICS

MULTI‐STRUCTURED DATA

APPLICATIONS

STREAMPROCESSING

The Total Data Warehouse

OPERATIONAL INTELLIGENCE


APPLICATIONS

NOSQL

ANALYTIC DATABASE

STRUCTURED DATA

(NEW) SQLDATABASE

STRUCTURED DATA

APPLICATIONS

HADOOP DISTRIBUTED FILE SYSTEM


YARN

LOG PROCESSING

EXPLORATORYANALYTICSPLATFORM


There are various phrases used to describe this hybrid landscape in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse


‘Data gravity’ suggests that processing resources will migrate to the platform that stores the most data, or perhaps the most important data

The balance of power is currently with the analytic database

However, Hadoop’s flexibility to support data‐processing engines beyond MapReduce could tip the balance in its favor in the long term

Apache YARN enables multiple versions of MapReduce, and for HDFS to support data‐processing frameworks in addition to MapReduce Native SQL analytics Stream processing Graph processing Bulk synchronous parallel computing Machine learning

Apache Spark provides an in‐memory platform supporting high‐performance processing and multiple data processing engines

Data gravity and the Total Data Warehouse


Teradata’s Unified Data Architecture and QueryGrid ‐ enables querying of data in Teradata Database, Aster Database and Hortonworks

Pivotal’s Big Data Suite ‐ HD Hadoop distribution/Greenplum Database/GemFire distributed data grid and HAWQ SQL‐on‐Hadoop query engine

Cirro offers a federated approach to performing joins and query processing across multiple sources of data including relational database and Hadoop

Microsoft PolyBase enables SQL Server 2012 PDW analysts to query data in Hadoop using Microsoft’s T‐SQL PolyBase is only available as part of the Microsoft Analytics Platform System (APS) APS is an appliance that combines SQL Server 2012 PDW with Microsoft’s HDInsight distribution of Apache Hadoop APS is also the only way that customers can adopt SQL Server 2012 PDW data warehousing environment For Microsoft at least, Hadoop is an integral part of the next‐generation data warehouse

Example Total Data Warehouses


SQL‐on‐Hadoop engines clearly have a role to play in enabling the Total Data Warehouse SQL‐based querying of data in HDFS Federation of queries across multiple data platforms

SQL‐on‐Hadoop initiatives exploded in recent years as a means of uniting the large army of trained SQL analysts with the flexible data storage and processing capabilities of Hadoop

But SQL‐on‐Hadoop engines are not created equal Batch SQL‐on‐Hadoop Interactive SQL‐on‐Hadoop SQL‐and‐Hadoop Operational SQL‐on‐Hadoop

And the various offerings within those categories are differentiated

The role of SQL‐on‐Hadoop


SQL on/and Hadoop

Batch SQL‐on‐Hadoop

Native SQL‐like processing of data in HDFS (via MR/Tez) Hive on MapReduce

InteractiveSQL‐on‐Hadoop

Specialist SQL‐based query engine running on Hadoop

Apache Drill, Cloudera Impala, Hive on Tez,

Spark SQL

SQL‐and‐Hadoop Federated querying of data in

Hadoop and RDBMSTeradata, Microsoft,

Oracle, IBM

Operational SQL‐on‐Hadoop

Operational database that stores in in HDFS

Splice Machine, Trafodion

Approach Details Examples


SQL on Hadoop examples

Hive on TezFaster native querying than Hive on MapReduce, HiveQL compatibility, extreme‐scale data joins

Apache DrillANSI SQL, Hadoop, MongoDB, Cassandra, Riak, etc;

consume JSON data, query hierarchical data

Cloudera Impala High performance ad hoc processing, HiveQL compatibility, Parquet file format

Spark SQLIn‐memory SQL processing, Catalyst query optimizer, replacing Shark (Hive on Spark)

Approach Key features


Hadoop is largely complementary to existing data warehouse deployments

The future analytic data‐processing landscape will be a hybrid of analytic databases and Hadoop we call this the Total Data Warehouse

‘Data gravity’ suggests that processing resources will migrate to the platform that stores the most data, or perhaps the most important data

The balance of power is currently with the analytic database Hadoop’s flexibility tip the balance in its favor in the long term

SQL‐on‐Hadoop engines clearly have a role to play in enabling the Total Data Warehouse

But SQL‐on‐Hadoop engines are not created equal

Conclusion


Questions? [email protected]@maslett


Self Service Data Exploration with Apache Drill

© 2014 MapR Technologies 2

The MapR Distribution including Apache HadoopExponential

Growth500+

CustomersPremier

Investors

>2x>2x annual bookings

80%80% of accounts expand 3X

90%90% software licenses

< 1%< 1% lifetime churn

> $1B> $1B in incremental revenuegenerated by 1 customer

Big Data

Riding the Wave with

HadoopThe Big Data

Platform of Choice


The Power of the Open Source Community

Man

agem

ent

Man

agem

ent

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

GovernanceTez*

Accumulo*

Hive

Impala

Shark

Drill

SQL

Sentry* Oozie ZooKeeperSqoop

Knox* WhirrFalcon*Flume

Data Integration& Access

HttpFS

Hue

* Certification/support planned for 2014

MapR-DBMapR-FS


UNSTRUCTURED DATA

STRUCTURED DATA

1980 2000 20101990 2020

Unstructured data will account for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

Total Data S

tored


Today’s Data Comes in Different Shapes…

Social Media

Messages

Audio

Sensors

Mobile Data

Email

Clickstream


Distance to Data

Business(analysts, developers)

“Plumbing” developmentMapReduce


Modeling and transformations

Hive and other SQL-on-Hadoop

Existing approaches require a middleman (IT)

Data

Data


Distance to Data


“Plumbing” developmentMapReduce

Hive and other SQL-on-Hadoop

Business(analysts, developers)Data Agility

Existing approaches require a middleman (IT)

Data

Data

Data


Modeling and transformations


Why Improve Distance to Data?

• Enable rapid data exploration and application development

• IT should provide a valuable service without “getting in the way”

• Can’t add DBAs to keep up with the exponential data growth

• Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users

2Reduce the burden on ITImprove time to value


• Pioneering Data Agility for Hadoop• Apache open source project• Scale-out execution engine for low-latency queries• Unified SQL-based API for analytics & operational applications

APACHE DRILL

40+ contributors150+ years of experience buildingdatabases and distributed systems


Evolution Towards Self-Service Data Exploration

Data Modeling and Transformation

Data Visualization

IT-driven

IT-driven

IT-driven

Self-service

IT-driven

Self-service

Not needed

Self-service

Traditional BIw/ RDBMS

Self-Service BIw/ RDBMS SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics


MapR Optimized Data Architecture

SourcesRELATIONAL, SAAS, MAINFRAME

DOCUMENTS, EMAILS

LOG FILES, CLICKSTREAMSSENSORS

BLOGS, TWEETS,LINK DATA

DATA WAREHOUSE

Data Movement

Data Access

Analytics

Search

Schema-less data exploration

BI, reportingAd-hoc integrated

analytics

Data Transformation, Enrichment and Integration

MAPR DISTRIBUTION FOR HADOOP

Streaming(Spark Streaming, Storm)

NoSQL ODBMS(HBase, Accumulo, …)

MapR Data PlatformMapR-DB

MAPR DISTRIBUTION FOR HADOOP

Batch / Search(MR, Spark, Hive, Pig, …)

MapR-FS

Operational Apps

Recommendations

Fraud Detection

Logistics

Optimized Data Architecture Machine Learning


(1) Self-Describing Data is Ubiquitous

Flat files in DFS• Complex data (Thrift, Avro, protobuf)• Columnar data (Parquet, ORC)• Loosely defined (JSON)• Traditional files (CSV, TSV)

Data stored in NoSQL stores• Relational-like (rows, columns)• Sparse data (NoSQL maps)• Embedded blobs (JSON)• Document stores (nested objects)

{name: {

first: Michael,last: Smith

},hobbies: [ski, soccer],district: Los Altos

}{

name: {first: Jennifer,last: Gates

},hobbies: [sing],preschool: CCLC

}


(2) Drill’s Data Model is Flexible

HBase

JSONBSON

CSVTSV

ParquetAvro

Schema-lessFixed schema

Flat

Complex

Flexibility

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{name: {

first: Michael,last: Smith

},hobbies: [ski, soccer],district: Los Altos

}{

name: {first: Jennifer,last: Gates

},hobbies: [sing],preschool: CCLC

}

RDBMS/SQL-on-Hadoop table

Apache Drill table


(3) Drill Supports Schema Discovery On-The-Fly

• Fixed schema• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or schema-less

• Leverage schema in centralized repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY


Quick TourSelf-Service Data Exploration with Apache Drill


Zero to Results in 2 Minutes (3 Commands)$ tar xzf apache-drill.tar.gz

$ apache-drill/bin/sqlline -u jdbc:drill:zk=local

0: jdbc:drill:zk=local>SELECT count(*) AS incidents, columns[1] AS categoryFROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv`GROUP BY columns[1]ORDER BY incidents DESC;

+------------+------------+| incidents | category |+------------+------------+| 8372 | LARCENY/THEFT || 4247 | OTHER OFFENSES || 3765 | NON-CRIMINAL || 2502 | ASSAULT |...35 rows selected (0.847 seconds)

Install

Launch shell (embedded mode)

Query

Results


A storage engine instance- DFS- HBase- Hive Metastore/HCatalog

A workspace- Sub-directory- Hive database

A table- pathnames- HBase table- Hive table

Data Source is in the Query

SELECT timestamp, messageFROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`WHERE errorLevel > 2


Query Directory Trees# Query file: How many errors per level in Jan 2014?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`GROUP BY errorLevel;

# Query directory sub-tree: How many errors per level?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`GROUP BY errorLevel;

# Query some partitions: How many errors per level by month from 2012?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`WHERE dirs[1] >= 2012GROUP BY errorLevel, dirs[2];


Works with HBase and Embedded Blobs# Query an HBase table directly (no schemas)

SELECT cf1.month, cf1.year FROM hbase.table1;

# Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users

SELECT profile.name, count(profile.children)FROM (SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profileFROM hbase.users

)


Combine Data Sources on the Fly# Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message.

SELECT DISTINCT users.name, users.emails.workFROM dfs.logs.`/data/logs` logs,

dfs.users.`/profiles.json` usersWHERE logs.uid = users.id AND

logs.errorLevel > 5;

# Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user

SELECT users.name, count(*) as tweetCountFROM hive.social.tweets tweets,

hbase.users usersWHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')GROUP BY tweets.userId;


Summary• Enable rapid data exploration and application development while

reducing the burden on IT

• Apache Drill 0.5 available now

• Get involved– Download and play: http://incubator.apache.org/drill/– Ask questions: [email protected]– Contribute: http://github.com/apache/incubator-drill/– Join the Drill team at MapR

• Email [email protected]• www.mapr.com/careers

future of-hadoop-analytics

Technology