future of-hadoop-analytics

37
© 2014 MapR Technologies 1 © 2014 MapR Technologies

Upload: mapr-data-technologies

Post on 28-Jan-2018

2.092 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Future of-hadoop-analytics

© 2014 MapR Technologies 1© 2014 MapR Technologies

Page 2: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

Introducing the Total Data Warehouse

Matthew AslettResearch Director, Data Management and Analytics, 451 Research

Page 3: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

Matthew Aslett• Research Director, Data Platforms and Analytics [email protected] www.twitter.com/maslett

Responsible for data management and analytics research agenda

Focus on operational and analytic databases, including NoSQL, NewSQL, and Hadoop

With 451 Research since 2007

Page 4: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

Company Overview

One company with 3 operating divisions

Syndicated research, advisory, professional services, datacenter certification, and events

Global focus

270+ staff 1,500+ client organizations: enterprises, vendors, service providers, and investment firms

Organic and growth through acquisition

Page 5: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

The rise of Apache Hadoop has been driven largely by demand for more flexible approaches to data management and analytics  Overcoming the limitations of traditional analytic databases and their adherence to strictly defined schema. 

Hadoop is largely complementary to existing data warehouse deployments

However, there is clear evidence that at least some workloads are being migrated from existing enterprise data warehouses to Hadoop

E.g. Teradata’s CEO noted in October 2013 that, on average, 20% of the total ETL workload on Teradata data warehouses could potentially move to Hadoop (4‐8% of the total Teradata data warehouse workload)

That has driven many people to question the extent to which Hadoop will replace the data warehouse

Hadoop and the data warehouse

Page 6: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

Survey conducted: Sept/Oct 2013Sample: 98

Hadoop and the data warehouse

Hadoop not yet used

Hadoop for workloads not 

previously on EDWTemporarily offloading 

workloads to Hadoop

Permanently migrating 

workloads to Hadoop

Hadoop replacing EDW

Describe the relationship between Hadoop and the enterprise data warehouse within your organization

Two‐thirds of Hadoop engagement is currently non‐threatening or additive to existing data warehouse deployments

Page 7: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

Frames the question incorrectly based on an assumption that a ‘data warehouse’ is by default based on an analytic relational database

A data warehouse as an enterprise platform for storing, processing and analyzing data could be based on an analytic database, Hadoop, or a combination of the two

Hadoop is primarily used to handle unstructured and semi‐structured data not a good fit – in terms of economics and data formats – for analytic databases

The future analytic data‐processing landscape will be a hybrid of analytic databases and Hadoop each used where appropriate for the individual analytic use case.

Hadoop replacing the data warehouse?

Page 8: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

There are various phrases used to describe this hybrid landscape  in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse

The primary platforms in a Total Data Warehouse are expected to be analytic databases and Hadoop

However we also expect to see the Total Data Warehouse comprise other data storage and processing platforms Exploratory analytics/discovery platforms Search Graph processing Stream processing Machine learning Log processing  NoSQL databases NewSQL databases

Introducing the Total Data Warehouse

Page 9: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

PRE‐DEFINED REPORTING

AD HOC ANALYTICS

STATISTICALANALYTICS

PREDICTIVEANALYTICS

MACHINE LEARNING MAPREDUCE

SEARCH‐BASED 

ANALYTICS

GRAPH ANALYTICS

MULTI‐STRUCTURED DATA

APPLICATIONS

STREAMPROCESSING

The Total Data Warehouse

OPERATIONAL INTELLIGENCE

MULTI‐STRUCTURED DATA

APPLICATIONS

NOSQL

ANALYTIC DATABASE

STRUCTURED DATA

(NEW) SQLDATABASE

STRUCTURED DATA

APPLICATIONS

HADOOP DISTRIBUTED FILE SYSTEM

MULTI‐STRUCTURED DATA

YARN

LOG PROCESSING

EXPLORATORYANALYTICSPLATFORM

MULTI‐STRUCTURED DATA

There are various phrases used to describe this hybrid landscape  in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse

Page 10: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

‘Data gravity’ suggests that processing resources will migrate to the platform that stores the most data, or perhaps the most important data 

The balance of power is currently with the analytic database

However, Hadoop’s flexibility to support data‐processing engines beyond MapReduce could tip the balance in its favor in the long term

Apache YARN enables multiple versions of MapReduce, and for HDFS to support data‐processing frameworks in addition to MapReduce Native SQL analytics Stream processing Graph processing Bulk synchronous parallel computing Machine learning

Apache Spark provides an in‐memory platform supporting high‐performance processing and multiple data processing engines

Data gravity and the Total Data Warehouse

Page 11: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

Teradata’s Unified Data Architecture and QueryGrid ‐ enables querying of data in Teradata Database, Aster Database and Hortonworks

Pivotal’s Big Data Suite ‐ HD Hadoop distribution/Greenplum Database/GemFire distributed data grid and HAWQ SQL‐on‐Hadoop query engine

Cirro offers a federated approach to performing joins and query processing across multiple sources of data including relational database and Hadoop 

Microsoft PolyBase enables SQL Server 2012 PDW analysts to query data in Hadoop using Microsoft’s T‐SQL PolyBase is only available as part of the Microsoft Analytics Platform System (APS) APS is an appliance that combines SQL Server 2012 PDW with Microsoft’s HDInsight distribution of Apache Hadoop  APS is also the only way that customers can adopt SQL Server 2012 PDW data warehousing environment For Microsoft at least, Hadoop is an integral part of the next‐generation data warehouse

Example Total Data Warehouses

Page 12: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

SQL‐on‐Hadoop engines clearly have a role to play in enabling the Total Data Warehouse SQL‐based querying of data in HDFS Federation of queries across multiple data platforms 

SQL‐on‐Hadoop initiatives exploded in recent years as a means of uniting the large army of trained SQL analysts with the flexible data storage and processing capabilities of Hadoop

But SQL‐on‐Hadoop engines are not created equal Batch SQL‐on‐Hadoop Interactive SQL‐on‐Hadoop SQL‐and‐Hadoop Operational SQL‐on‐Hadoop

And the various offerings within those categories are differentiated

The role of SQL‐on‐Hadoop

Page 13: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

SQL on/and Hadoop

Batch SQL‐on‐Hadoop 

Native SQL‐like processing of data in HDFS (via MR/Tez) Hive on MapReduce

InteractiveSQL‐on‐Hadoop 

Specialist SQL‐based query engine running on Hadoop

Apache Drill, Cloudera Impala, Hive on Tez, 

Spark SQL

SQL‐and‐Hadoop Federated querying of data in 

Hadoop and RDBMSTeradata, Microsoft, 

Oracle, IBM

Operational SQL‐on‐Hadoop

Operational database that stores in in HDFS

Splice Machine, Trafodion

Approach Details Examples

Page 14: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

SQL on Hadoop examples

Hive on TezFaster native querying than Hive on MapReduce, HiveQL compatibility, extreme‐scale data joins

Apache DrillANSI SQL, Hadoop, MongoDB, Cassandra, Riak, etc; 

consume JSON data, query hierarchical data

Cloudera Impala High performance ad hoc processing, HiveQL compatibility, Parquet file format

Spark SQLIn‐memory SQL processing, Catalyst query optimizer, replacing Shark (Hive on Spark)

Approach Key features

Page 15: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

Hadoop is largely complementary to existing data warehouse deployments

The future analytic data‐processing landscape will be a hybrid of analytic databases and Hadoop we call this the Total Data Warehouse

‘Data gravity’ suggests that processing resources will migrate to the platform that stores the most data, or perhaps the most important data 

The balance of power is currently with the analytic database Hadoop’s flexibility tip the balance in its favor in the long term

SQL‐on‐Hadoop engines clearly have a role to play in enabling the Total Data Warehouse

But SQL‐on‐Hadoop engines are not created equal

Conclusion

Page 16: Future of-hadoop-analytics

© 2014 by The 451 Group. All rights reserved 

Questions? [email protected]@maslett

Page 17: Future of-hadoop-analytics

© 2014 MapR Technologies 1© 2014 MapR Technologies

Self Service Data Exploration with Apache Drill

Page 18: Future of-hadoop-analytics

© 2014 MapR Technologies 2

The MapR Distribution including Apache HadoopExponential

Growth500+

CustomersPremier

Investors

>2x>2x annual bookings

80%80% of accounts expand 3X

90%90% software licenses

< 1%< 1% lifetime churn

> $1B> $1B in incremental revenuegenerated by 1 customer

Big Data

Riding the Wave with

HadoopThe Big Data

Platform of Choice

Page 19: Future of-hadoop-analytics

© 2014 MapR Technologies 3

The Power of the Open Source Community

Man

agem

ent

Man

agem

ent

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

GovernanceTez*

Accumulo*

Hive

Impala

Shark

Drill

SQL

Sentry* Oozie ZooKeeperSqoop

Knox* WhirrFalcon*Flume

Data Integration& Access

HttpFS

Hue

* Certification/support planned for 2014

MapR-DBMapR-FS

Page 20: Future of-hadoop-analytics

© 2014 MapR Technologies 4

UNSTRUCTURED DATA

STRUCTURED DATA

1980 2000 20101990 2020

Unstructured data will account for more than 80% of the data

collected by organizations

Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data

Total Data S

tored

Page 21: Future of-hadoop-analytics

© 2014 MapR Technologies 5

Today’s Data Comes in Different Shapes…

Social Media

Messages

Audio

Sensors

Mobile Data

Email

Clickstream

Page 22: Future of-hadoop-analytics

© 2014 MapR Technologies 6

Distance to Data

Business(analysts, developers)

“Plumbing” developmentMapReduce

Business(analysts, developers)

Modeling and transformations

Hive and other SQL-on-Hadoop

Existing approaches require a middleman (IT)

Data

Data

Page 23: Future of-hadoop-analytics

© 2014 MapR Technologies 7

Distance to Data

Business(analysts, developers)

“Plumbing” developmentMapReduce

Hive and other SQL-on-Hadoop

Business(analysts, developers)Data Agility

Existing approaches require a middleman (IT)

Data

Data

Data

Business(analysts, developers)

Modeling and transformations

Page 24: Future of-hadoop-analytics

© 2014 MapR Technologies 8

Why Improve Distance to Data?

• Enable rapid data exploration and application development

• IT should provide a valuable service without “getting in the way”

• Can’t add DBAs to keep up with the exponential data growth

• Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users

2Reduce the burden on ITImprove time to value

Page 25: Future of-hadoop-analytics

© 2014 MapR Technologies 9

• Pioneering Data Agility for Hadoop• Apache open source project• Scale-out execution engine for low-latency queries• Unified SQL-based API for analytics & operational applications

APACHE DRILL

40+ contributors150+ years of experience buildingdatabases and distributed systems

Page 26: Future of-hadoop-analytics

© 2014 MapR Technologies 10

Evolution Towards Self-Service Data Exploration

Data Modeling and Transformation

Data Visualization

IT-driven

IT-driven

IT-driven

Self-service

IT-driven

Self-service

Not needed

Self-service

Traditional BIw/ RDBMS

Self-Service BIw/ RDBMS SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics

Page 27: Future of-hadoop-analytics

© 2014 MapR Technologies 11

MapR Optimized Data Architecture

SourcesRELATIONAL, SAAS, MAINFRAME

DOCUMENTS, EMAILS

LOG FILES, CLICKSTREAMSSENSORS

BLOGS, TWEETS,LINK DATA

DATA WAREHOUSE

Data Movement

Data Access

Analytics

Search

Schema-less data exploration

BI, reportingAd-hoc integrated

analytics

Data Transformation, Enrichment and Integration

MAPR DISTRIBUTION FOR HADOOP

Streaming(Spark Streaming, Storm)

NoSQL ODBMS(HBase, Accumulo, …)

MapR Data PlatformMapR-DB

MAPR DISTRIBUTION FOR HADOOP

Batch / Search(MR, Spark, Hive, Pig, …)

MapR-FS

Operational Apps

Recommendations

Fraud Detection

Logistics

Optimized Data Architecture Machine Learning

Page 28: Future of-hadoop-analytics

© 2014 MapR Technologies 12

(1) Self-Describing Data is Ubiquitous

Flat files in DFS• Complex data (Thrift, Avro, protobuf)• Columnar data (Parquet, ORC)• Loosely defined (JSON)• Traditional files (CSV, TSV)

Data stored in NoSQL stores• Relational-like (rows, columns)• Sparse data (NoSQL maps)• Embedded blobs (JSON)• Document stores (nested objects)

{name: {

first: Michael,last: Smith

},hobbies: [ski, soccer],district: Los Altos

}{

name: {first: Jennifer,last: Gates

},hobbies: [sing],preschool: CCLC

}

Page 29: Future of-hadoop-analytics

© 2014 MapR Technologies 13

(2) Drill’s Data Model is Flexible

HBase

JSONBSON

CSVTSV

ParquetAvro

Schema-lessFixed schema

Flat

Complex

Flexibility

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{name: {

first: Michael,last: Smith

},hobbies: [ski, soccer],district: Los Altos

}{

name: {first: Jennifer,last: Gates

},hobbies: [sing],preschool: CCLC

}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Page 30: Future of-hadoop-analytics

© 2014 MapR Technologies 14

(3) Drill Supports Schema Discovery On-The-Fly

• Fixed schema• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or schema-less

• Leverage schema in centralized repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

Page 31: Future of-hadoop-analytics

© 2014 MapR Technologies 15© 2014 MapR Technologies

Quick TourSelf-Service Data Exploration with Apache Drill

Page 32: Future of-hadoop-analytics

© 2014 MapR Technologies 16

Zero to Results in 2 Minutes (3 Commands)$ tar xzf apache-drill.tar.gz

$ apache-drill/bin/sqlline -u jdbc:drill:zk=local

0: jdbc:drill:zk=local>SELECT count(*) AS incidents, columns[1] AS categoryFROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv`GROUP BY columns[1]ORDER BY incidents DESC;

+------------+------------+| incidents | category |+------------+------------+| 8372 | LARCENY/THEFT || 4247 | OTHER OFFENSES || 3765 | NON-CRIMINAL || 2502 | ASSAULT |...35 rows selected (0.847 seconds)

Install

Launch shell (embedded mode)

Query

Results

Page 33: Future of-hadoop-analytics

© 2014 MapR Technologies 17

A storage engine instance- DFS- HBase- Hive Metastore/HCatalog

A workspace- Sub-directory- Hive database

A table- pathnames- HBase table- Hive table

Data Source is in the Query

SELECT timestamp, messageFROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`WHERE errorLevel > 2

Page 34: Future of-hadoop-analytics

© 2014 MapR Technologies 18

Query Directory Trees# Query file: How many errors per level in Jan 2014?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`GROUP BY errorLevel;

# Query directory sub-tree: How many errors per level?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`GROUP BY errorLevel;

# Query some partitions: How many errors per level by month from 2012?

SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`WHERE dirs[1] >= 2012GROUP BY errorLevel, dirs[2];

Page 35: Future of-hadoop-analytics

© 2014 MapR Technologies 19

Works with HBase and Embedded Blobs# Query an HBase table directly (no schemas)

SELECT cf1.month, cf1.year FROM hbase.table1;

# Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users

SELECT profile.name, count(profile.children)FROM (SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profileFROM hbase.users

)

Page 36: Future of-hadoop-analytics

© 2014 MapR Technologies 20

Combine Data Sources on the Fly# Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message.

SELECT DISTINCT users.name, users.emails.workFROM dfs.logs.`/data/logs` logs,

dfs.users.`/profiles.json` usersWHERE logs.uid = users.id AND

logs.errorLevel > 5;

# Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user

SELECT users.name, count(*) as tweetCountFROM hive.social.tweets tweets,

hbase.users usersWHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')GROUP BY tweets.userId;

Page 37: Future of-hadoop-analytics

© 2014 MapR Technologies 21

Summary• Enable rapid data exploration and application development while

reducing the burden on IT

• Apache Drill 0.5 available now

• Get involved– Download and play: http://incubator.apache.org/drill/– Ask questions: [email protected]– Contribute: http://github.com/apache/incubator-drill/– Join the Drill team at MapR

• Email [email protected]• www.mapr.com/careers