future of-hadoop-analytics
TRANSCRIPT
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 by The 451 Group. All rights reserved
Introducing the Total Data Warehouse
Matthew AslettResearch Director, Data Management and Analytics, 451 Research
© 2014 by The 451 Group. All rights reserved
Matthew Aslett• Research Director, Data Platforms and Analytics [email protected] www.twitter.com/maslett
Responsible for data management and analytics research agenda
Focus on operational and analytic databases, including NoSQL, NewSQL, and Hadoop
With 451 Research since 2007
© 2014 by The 451 Group. All rights reserved
Company Overview
One company with 3 operating divisions
Syndicated research, advisory, professional services, datacenter certification, and events
Global focus
270+ staff 1,500+ client organizations: enterprises, vendors, service providers, and investment firms
Organic and growth through acquisition
© 2014 by The 451 Group. All rights reserved
The rise of Apache Hadoop has been driven largely by demand for more flexible approaches to data management and analytics Overcoming the limitations of traditional analytic databases and their adherence to strictly defined schema.
Hadoop is largely complementary to existing data warehouse deployments
However, there is clear evidence that at least some workloads are being migrated from existing enterprise data warehouses to Hadoop
E.g. Teradata’s CEO noted in October 2013 that, on average, 20% of the total ETL workload on Teradata data warehouses could potentially move to Hadoop (4‐8% of the total Teradata data warehouse workload)
That has driven many people to question the extent to which Hadoop will replace the data warehouse
Hadoop and the data warehouse
© 2014 by The 451 Group. All rights reserved
Survey conducted: Sept/Oct 2013Sample: 98
Hadoop and the data warehouse
Hadoop not yet used
Hadoop for workloads not
previously on EDWTemporarily offloading
workloads to Hadoop
Permanently migrating
workloads to Hadoop
Hadoop replacing EDW
Describe the relationship between Hadoop and the enterprise data warehouse within your organization
Two‐thirds of Hadoop engagement is currently non‐threatening or additive to existing data warehouse deployments
© 2014 by The 451 Group. All rights reserved
Frames the question incorrectly based on an assumption that a ‘data warehouse’ is by default based on an analytic relational database
A data warehouse as an enterprise platform for storing, processing and analyzing data could be based on an analytic database, Hadoop, or a combination of the two
Hadoop is primarily used to handle unstructured and semi‐structured data not a good fit – in terms of economics and data formats – for analytic databases
The future analytic data‐processing landscape will be a hybrid of analytic databases and Hadoop each used where appropriate for the individual analytic use case.
Hadoop replacing the data warehouse?
© 2014 by The 451 Group. All rights reserved
There are various phrases used to describe this hybrid landscape in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse
The primary platforms in a Total Data Warehouse are expected to be analytic databases and Hadoop
However we also expect to see the Total Data Warehouse comprise other data storage and processing platforms Exploratory analytics/discovery platforms Search Graph processing Stream processing Machine learning Log processing NoSQL databases NewSQL databases
Introducing the Total Data Warehouse
© 2014 by The 451 Group. All rights reserved
PRE‐DEFINED REPORTING
AD HOC ANALYTICS
STATISTICALANALYTICS
PREDICTIVEANALYTICS
MACHINE LEARNING MAPREDUCE
SEARCH‐BASED
ANALYTICS
GRAPH ANALYTICS
MULTI‐STRUCTURED DATA
APPLICATIONS
STREAMPROCESSING
The Total Data Warehouse
OPERATIONAL INTELLIGENCE
MULTI‐STRUCTURED DATA
APPLICATIONS
NOSQL
ANALYTIC DATABASE
STRUCTURED DATA
(NEW) SQLDATABASE
STRUCTURED DATA
APPLICATIONS
HADOOP DISTRIBUTED FILE SYSTEM
MULTI‐STRUCTURED DATA
YARN
LOG PROCESSING
EXPLORATORYANALYTICSPLATFORM
MULTI‐STRUCTURED DATA
There are various phrases used to describe this hybrid landscape in keeping with our ‘Total Data’ terminology, we call this the Total Data Warehouse
© 2014 by The 451 Group. All rights reserved
‘Data gravity’ suggests that processing resources will migrate to the platform that stores the most data, or perhaps the most important data
The balance of power is currently with the analytic database
However, Hadoop’s flexibility to support data‐processing engines beyond MapReduce could tip the balance in its favor in the long term
Apache YARN enables multiple versions of MapReduce, and for HDFS to support data‐processing frameworks in addition to MapReduce Native SQL analytics Stream processing Graph processing Bulk synchronous parallel computing Machine learning
Apache Spark provides an in‐memory platform supporting high‐performance processing and multiple data processing engines
Data gravity and the Total Data Warehouse
© 2014 by The 451 Group. All rights reserved
Teradata’s Unified Data Architecture and QueryGrid ‐ enables querying of data in Teradata Database, Aster Database and Hortonworks
Pivotal’s Big Data Suite ‐ HD Hadoop distribution/Greenplum Database/GemFire distributed data grid and HAWQ SQL‐on‐Hadoop query engine
Cirro offers a federated approach to performing joins and query processing across multiple sources of data including relational database and Hadoop
Microsoft PolyBase enables SQL Server 2012 PDW analysts to query data in Hadoop using Microsoft’s T‐SQL PolyBase is only available as part of the Microsoft Analytics Platform System (APS) APS is an appliance that combines SQL Server 2012 PDW with Microsoft’s HDInsight distribution of Apache Hadoop APS is also the only way that customers can adopt SQL Server 2012 PDW data warehousing environment For Microsoft at least, Hadoop is an integral part of the next‐generation data warehouse
Example Total Data Warehouses
© 2014 by The 451 Group. All rights reserved
SQL‐on‐Hadoop engines clearly have a role to play in enabling the Total Data Warehouse SQL‐based querying of data in HDFS Federation of queries across multiple data platforms
SQL‐on‐Hadoop initiatives exploded in recent years as a means of uniting the large army of trained SQL analysts with the flexible data storage and processing capabilities of Hadoop
But SQL‐on‐Hadoop engines are not created equal Batch SQL‐on‐Hadoop Interactive SQL‐on‐Hadoop SQL‐and‐Hadoop Operational SQL‐on‐Hadoop
And the various offerings within those categories are differentiated
The role of SQL‐on‐Hadoop
© 2014 by The 451 Group. All rights reserved
SQL on/and Hadoop
Batch SQL‐on‐Hadoop
Native SQL‐like processing of data in HDFS (via MR/Tez) Hive on MapReduce
InteractiveSQL‐on‐Hadoop
Specialist SQL‐based query engine running on Hadoop
Apache Drill, Cloudera Impala, Hive on Tez,
Spark SQL
SQL‐and‐Hadoop Federated querying of data in
Hadoop and RDBMSTeradata, Microsoft,
Oracle, IBM
Operational SQL‐on‐Hadoop
Operational database that stores in in HDFS
Splice Machine, Trafodion
Approach Details Examples
© 2014 by The 451 Group. All rights reserved
SQL on Hadoop examples
Hive on TezFaster native querying than Hive on MapReduce, HiveQL compatibility, extreme‐scale data joins
Apache DrillANSI SQL, Hadoop, MongoDB, Cassandra, Riak, etc;
consume JSON data, query hierarchical data
Cloudera Impala High performance ad hoc processing, HiveQL compatibility, Parquet file format
Spark SQLIn‐memory SQL processing, Catalyst query optimizer, replacing Shark (Hive on Spark)
Approach Key features
© 2014 by The 451 Group. All rights reserved
Hadoop is largely complementary to existing data warehouse deployments
The future analytic data‐processing landscape will be a hybrid of analytic databases and Hadoop we call this the Total Data Warehouse
‘Data gravity’ suggests that processing resources will migrate to the platform that stores the most data, or perhaps the most important data
The balance of power is currently with the analytic database Hadoop’s flexibility tip the balance in its favor in the long term
SQL‐on‐Hadoop engines clearly have a role to play in enabling the Total Data Warehouse
But SQL‐on‐Hadoop engines are not created equal
Conclusion
© 2014 by The 451 Group. All rights reserved
Questions? [email protected]@maslett
© 2014 MapR Technologies 1© 2014 MapR Technologies
Self Service Data Exploration with Apache Drill
© 2014 MapR Technologies 2
The MapR Distribution including Apache HadoopExponential
Growth500+
CustomersPremier
Investors
>2x>2x annual bookings
80%80% of accounts expand 3X
90%90% software licenses
< 1%< 1% lifetime churn
> $1B> $1B in incremental revenuegenerated by 1 customer
Big Data
Riding the Wave with
HadoopThe Big Data
Platform of Choice
© 2014 MapR Technologies 3
The Power of the Open Source Community
Man
agem
ent
Man
agem
ent
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark Streaming
Storm*
Streaming
HBase
Solr
NoSQL & Search
Juju
Provisioning &
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow & Data
GovernanceTez*
Accumulo*
Hive
Impala
Shark
Drill
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data Integration& Access
HttpFS
Hue
* Certification/support planned for 2014
MapR-DBMapR-FS
© 2014 MapR Technologies 4
UNSTRUCTURED DATA
STRUCTURED DATA
1980 2000 20101990 2020
Unstructured data will account for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data S
tored
© 2014 MapR Technologies 5
Today’s Data Comes in Different Shapes…
Social Media
Messages
Audio
Sensors
Mobile Data
Clickstream
© 2014 MapR Technologies 6
Distance to Data
Business(analysts, developers)
“Plumbing” developmentMapReduce
Business(analysts, developers)
Modeling and transformations
Hive and other SQL-on-Hadoop
Existing approaches require a middleman (IT)
Data
Data
© 2014 MapR Technologies 7
Distance to Data
Business(analysts, developers)
“Plumbing” developmentMapReduce
Hive and other SQL-on-Hadoop
Business(analysts, developers)Data Agility
Existing approaches require a middleman (IT)
Data
Data
Data
Business(analysts, developers)
Modeling and transformations
© 2014 MapR Technologies 8
Why Improve Distance to Data?
• Enable rapid data exploration and application development
• IT should provide a valuable service without “getting in the way”
• Can’t add DBAs to keep up with the exponential data growth
• Minimize “unnecessary work” so IT can focus on value-added activities and become a partner to the business users
2Reduce the burden on ITImprove time to value
© 2014 MapR Technologies 9
• Pioneering Data Agility for Hadoop• Apache open source project• Scale-out execution engine for low-latency queries• Unified SQL-based API for analytics & operational applications
APACHE DRILL
40+ contributors150+ years of experience buildingdatabases and distributed systems
© 2014 MapR Technologies 10
Evolution Towards Self-Service Data Exploration
Data Modeling and Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BIw/ RDBMS
Self-Service BIw/ RDBMS SQL-on-Hadoop
Self-Service Data Exploration
Zero-day analytics
© 2014 MapR Technologies 11
MapR Optimized Data Architecture
SourcesRELATIONAL, SAAS, MAINFRAME
DOCUMENTS, EMAILS
LOG FILES, CLICKSTREAMSSENSORS
BLOGS, TWEETS,LINK DATA
DATA WAREHOUSE
Data Movement
Data Access
Analytics
Search
Schema-less data exploration
BI, reportingAd-hoc integrated
analytics
Data Transformation, Enrichment and Integration
MAPR DISTRIBUTION FOR HADOOP
Streaming(Spark Streaming, Storm)
NoSQL ODBMS(HBase, Accumulo, …)
MapR Data PlatformMapR-DB
MAPR DISTRIBUTION FOR HADOOP
Batch / Search(MR, Spark, Hive, Pig, …)
MapR-FS
Operational Apps
Recommendations
Fraud Detection
Logistics
Optimized Data Architecture Machine Learning
© 2014 MapR Technologies 12
(1) Self-Describing Data is Ubiquitous
Flat files in DFS• Complex data (Thrift, Avro, protobuf)• Columnar data (Parquet, ORC)• Loosely defined (JSON)• Traditional files (CSV, TSV)
Data stored in NoSQL stores• Relational-like (rows, columns)• Sparse data (NoSQL maps)• Embedded blobs (JSON)• Document stores (nested objects)
{name: {
first: Michael,last: Smith
},hobbies: [ski, soccer],district: Los Altos
}{
name: {first: Jennifer,last: Gates
},hobbies: [sing],preschool: CCLC
}
© 2014 MapR Technologies 13
(2) Drill’s Data Model is Flexible
HBase
JSONBSON
CSVTSV
ParquetAvro
Schema-lessFixed schema
Flat
Complex
Flexibility
Flexibility
Name Gender AgeMichael M 6Jennifer F 3
{name: {
first: Michael,last: Smith
},hobbies: [ski, soccer],district: Los Altos
}{
name: {first: Jennifer,last: Gates
},hobbies: [sing],preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
© 2014 MapR Technologies 14
(3) Drill Supports Schema Discovery On-The-Fly
• Fixed schema• Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or schema-less
• Leverage schema in centralized repository or self-describing data
2Schema Discovered On-The-FlySchema Declared In Advance
SCHEMA ON WRITE
SCHEMA BEFORE READ
SCHEMA ON THE FLY
© 2014 MapR Technologies 15© 2014 MapR Technologies
Quick TourSelf-Service Data Exploration with Apache Drill
© 2014 MapR Technologies 16
Zero to Results in 2 Minutes (3 Commands)$ tar xzf apache-drill.tar.gz
$ apache-drill/bin/sqlline -u jdbc:drill:zk=local
0: jdbc:drill:zk=local>SELECT count(*) AS incidents, columns[1] AS categoryFROM dfs.`/tmp/SFPD_Incidents_-_Previous_Three_Months.csv`GROUP BY columns[1]ORDER BY incidents DESC;
+------------+------------+| incidents | category |+------------+------------+| 8372 | LARCENY/THEFT || 4247 | OTHER OFFENSES || 3765 | NON-CRIMINAL || 2502 | ASSAULT |...35 rows selected (0.847 seconds)
Install
Launch shell (embedded mode)
Query
Results
© 2014 MapR Technologies 17
A storage engine instance- DFS- HBase- Hive Metastore/HCatalog
A workspace- Sub-directory- Hive database
A table- pathnames- HBase table- Hive table
Data Source is in the Query
SELECT timestamp, messageFROM dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet`WHERE errorLevel > 2
© 2014 MapR Technologies 18
Query Directory Trees# Query file: How many errors per level in Jan 2014?
SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet`GROUP BY errorLevel;
# Query directory sub-tree: How many errors per level?
SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`GROUP BY errorLevel;
# Query some partitions: How many errors per level by month from 2012?
SELECT errorLevel, count(*)FROM dfs.logs.`/AppServerLogs`WHERE dirs[1] >= 2012GROUP BY errorLevel, dirs[2];
© 2014 MapR Technologies 19
Works with HBase and Embedded Blobs# Query an HBase table directly (no schemas)
SELECT cf1.month, cf1.year FROM hbase.table1;
# Embedded JSON value inside column profileBlob inside column family cf1 of the HBase table users
SELECT profile.name, count(profile.children)FROM (SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profileFROM hbase.users
)
© 2014 MapR Technologies 20
Combine Data Sources on the Fly# Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message.
SELECT DISTINCT users.name, users.emails.workFROM dfs.logs.`/data/logs` logs,
dfs.users.`/profiles.json` usersWHERE logs.uid = users.id AND
logs.errorLevel > 5;
# Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user
SELECT users.name, count(*) as tweetCountFROM hive.social.tweets tweets,
hbase.users usersWHERE tweets.userId = convert_from(users.rowkey, 'UTF-8')GROUP BY tweets.userId;
© 2014 MapR Technologies 21
Summary• Enable rapid data exploration and application development while
reducing the burden on IT
• Apache Drill 0.5 available now
• Get involved– Download and play: http://incubator.apache.org/drill/– Ask questions: [email protected]– Contribute: http://github.com/apache/incubator-drill/– Join the Drill team at MapR
• Email [email protected]• www.mapr.com/careers