webinar turbo charging_data_science_hawq_on_hdp_final
TRANSCRIPT
Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Turbocharging Your Data Science with HAWQ on the Hortonworks Data Platform
We Do Hadoop
Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Your Hosts
Michael Cucchi • Sr. Director of Outbound Product for Pivotal's Data,
Mobile, and IoT solutions • 20 years of engineering, management, and
marketing experience in the high-tech industry
@mikecucchi Matt Morgan • Vice President, Global Product Marketing • 20 year history as a marketing and product
executive in cloud, SaaS, and big data businesses
@forwardtension
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Establish Hadoop as the Foundational Technology of the Modern Enterprise
Data Architecture
Year Founded In 2011, 24 engineers from the original Hadoop team at Yahoo! spun out to form Hortonworks.
Ticker Symbol NASDAQ: HDP
Headquarters Santa Clara, CA
Business Model Open Source Software Support Subscriptions, Training and Consulting Services
Non-GAAP Billings Grew from zero to over $120 million on an annualized basis in 11 quarters
Subscription Customers
437 in 11 quarters with 105 added in Q1-2015 alone.
Support 24×7, global web, telephone support
Partners 1100 joint engineering, strategic reseller, technology, and system integrator partners
Employees 650+
Global Operations 17 countries
#1 28 out of 86 Apache Hadoop committers Hortonworks employs the largest group of Hadoop committers under one roof; more than twice any other company.
#1 165 Apache committer seats for projects in HDP Our committers work in 20+ projects on the data access, management, security, operations, and governance needs of the enterprise; more than twice any other company.
Hortonworks Quick Facts
The Forrester Wave™ Big Data Hadoop Solutions We are recognized as a leader in Hadoop by Forrester Research based on the strengths of our offerings and strategy
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Traditional Systems Under Pressure Challenges • Constrains data to app • Can’t manage new data • Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012 2.8 Zettabytes
2020 40 Zettabytes
LAGGARDS
INDUSTRY LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Early Hadoop: The Start of a Modern Data Architecture Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data • Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages ü Manages new data paradigm ü Handles data at scale ü Cost effective ü Open source
Traditional Hadoop Had Limitations Batch-only architecture with limited analytic options Single purpose clusters, specific data sets Difficult to integrate with existing investments Not enterprise-grade
Application
Storage HDFS
Batch Processing MapReduce
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Today: Modern Data Architecture Unifies Data & Processing
Modern Data Architecture • Enable applications to have access to
all your enterprise data through an efficient centralized platform
• Supported with a centralized approach governance, security and operations
• Versatile to handle any applications and datasets no matter the size or type
Clickstream Web & Social
Geoloca3on Sensor & Machine
Server Logs
Unstructured
SOU
RC
ES
Existing Systems
ERP CRM SCM
AN
ALY
TIC
S
Data Marts
Business Analytics
Visualization & Dashboards
AN
ALY
TIC
S
Applications Business Analytics
Visualization & Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS (Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-Time Batch Partner ISV Batch Batch MPP EDW
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
OPERATIONAL TOOLS
DEV & DATA TOOLS
INFRASTRUCTURE
Partnerships Enrich the Hadoop Ecosystem
Clickstream Web & Social
Geoloca3on Sensor & Machine
Server Logs
Unstructured
SOU
RC
ES
Existing Systems
ERP CRM SCM
AN
ALY
TIC
S
Data Marts
Business Analytics
Visualization & Dashboards
AN
ALY
TIC
S
Applications Business Analytics
Visualization & Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS (Hadoop Distributed File System)
Deep Partnerships
Hortonworks engages in deep engineered relationships with the leaders in the data center, such as EMC, Microsoft, Teradata, Red Hat, HP, SAS & SAP Broad Partnerships
Over 1100 partners work with us to certify their applications to work with Hadoop so they can extend big data to their users
YARN: Data Operating System EDW
Interactive Real-Time Batch Partner ISV
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Adoption Follows a Predictable Journey Cost Optimization, new analytic apps, and ultimately to a data lake
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Driver: Cost optimization
Archive Data off EDW Move rarely used data to Hadoop as active archive, store more data longer
Offload costly ETL process Free your EDW to perform high-value functions like analytics & operations, not ETL
Enrich the value of your EDW Use Hadoop to refine new data sources, such as web and machine data for new analytical context
AN
ALY
TIC
S
Data Marts
Business Analytics
Visualization & Dashboards
HDP helps you reduce costs and optimize the value associated with your EDW
AN
ALY
TIC
S D
ATA
SYST
EMS
Data Marts
Business Analytics
Visualization & Dashboards
HDP 2.2
ELT °
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data, Deeper Archive & New Sources
Enterprise Data Warehouse
Hot
MPP
In-Memory
Clickstream Web & Social
Geoloca3on Sensor & Machine
Server Logs
Unstructured
Existing Systems
ERP CRM SCM
SOU
RC
ES
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Driver: Advanced analytic applications
Single View: Improve acquisition & retention • HDP enables a single view of each
customer, allowing organizations to provide targeted, personalized customer experiences.
• Single view reduces attrition, improves cross-sell and improves customer satisfaction.
Predictive Analytics: Identify next best action • HDP captures, stores and processes
large volumes of data streaming from connected devices
• Stream processing and data science help introduce new analytics for real-time and batch analysis
Data Discovery: Uncover new findings • HDP allows exploration of new data
types and large data sets that were previously too big to capture, store & process.
• Unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data.
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
360° Customer View Boosts Sales at Home Supply Retailer
Problem: Lack of unified customer record across all channels clouded targeting for marketing campaigns
• No “golden record” for analytics on customer buying behavior across all channels
• Data repositories on web traffic, POS transactions and in-home services existed in
isolation of each other
• Data storage costs were increasing, without a corresponding increase in value
Solution: HDP data lake drives golden customer record, targeted marketing, and reduction in data storage expenses
• Golden record enables targeted, personalized marketing with higher success rates
• Data warehouse offload saved millions of dollars in recurring expense
• Price optimization versus competitors à several millions in top-line revenue growth
New Analytic Applications Clickstream, Unstructured
and Structured Data
Retail
Major home improvement retailer
RT2
Why Hadoop?
Single View
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Responsive Patient Treatment with Real-time Monitoring of Vitals
Problem: Inability to store and access sufficient data for medical decision support in real time
• 9 million patient records on a legacy system were not searchable nor retrievable
• Cohort selection for research projects was slow, despite abundance of data
• Clinicians had minimal access to historical data gathered across all patients
Solution: Unified data lake improves patient health, speeds research
• Legacy system retired immediately, saving $500K in annual recurring expense
• Records stored with patient identification for clinical use, same data presented
anonymously to researchers for cohort selection
• Wireless patches transmit vital signs, algorithms notify doctors of high risk patterns
• Heart patients weigh themselves from home, algorithms notify doctors about unsafe
weight changes and recommend a visit to the clinic
New Analytic Applications Sensor, Social Data
& ETL Offload
Healthcare
Public university teaching hospital
HC2
Why Hadoop?
Predictive Analytics
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Driver: Enabling the Data Lake SC
ALE
SCOPE
Data Lake Definition • Centralized Architecture
Multiple applications on a shared data set with consistent levels of service
• Any App, Any Data Multiple applications accessing all data affording new insights and opportunities.
• Unlocks ‘Systems of Insight’ Advanced algorithms and applications used to derive new value and optimize existing value.
Drivers: 1. Cost Optimization 2. Advanced Analytic Apps
Goal: • Centralized Architecture • Data-driven Business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Case Study: 12-Month Hadoop Evolution at TrueCar D
ata
Plat
form
Cap
abili
ties
12 months execution plan
June 2013 Begin Hadoop Execution
July 2013 Hortonworks Partnership
May ‘14 IPO
Aug 2013 Training & Dev Begins
Nov 2013 Production Cluster 60 Nodes 2 PB
Jan 2014 40% Dev Staff Proficient
Dec 2013 Three Production Apps (3 total)
Feb 2014 Three More Production Apps (6 total)
12 Month Results at TRUECar • Six Production Hadoop Applications • Sixty nodes/2PB data • Storage Costs/Compute Costs
from $19/GB to $0.12/GB
“We addressed our data platform capabilities strategically as a pre-cursor to IPO.”
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hortonworks Data Platform Hadoop for the Enterprise
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP Makes Hadoop Enterprise-Ready
Hortonworks Data Platform Multi-tenant data platform built on a centralized architecture of shared enterprise services
YARN: data operating system
Governance Security
Operations
Resource management
Existing applications
New analytics
Partner applications
Data access: batch, interactive, real-time
Storage
Key benefits Consolidates all data sets
Delivers real-time insights
Integrates with data center
Scalable and affordable
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Any application Batch, interactive, and real-time
Any data Existing and new datasets
Anywhere Complete range of deployment options
Commodity Appliance Cloud
HDP Makes Hadoop Pervasive
YARN: data operating system
Existing applications
New analytics
Partner applications
Data access: batch, interactive, real-time
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
An “Any Application” Example: Spark in HDP
Delivering a production-ready experience for Spark applications
• Centralized Resource Management Integrated with YARN
• Consistent Operations Provisioned and managed by Ambari
• Comprehensive Security Runs within secure clusters
• Deployable Anywhere Windows, Linux, on-premises or cloud; consistent Cloudbreak launch experience
YARN: data operating system
Governance Security
Operations
Resource management
Storage
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
BI / Analytics (Hive)
IoT Apps (Storm, HBase, Hive)
An “Anywhere” Example: Cloudbreak and HDP
Dev / Test (all HDP services)
Data Science (Spark)
Cloudbreak
1. Pick a Blueprint 2. Choose a Cloud 3. Launch HDP!
Example Ambari Blueprints: IoT Apps, BI / Analytics, Data Science, Dev / Test
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
“Hortonworks loves and lives open source innovation” World Class Support and Services. Hortonworks' Customer Support received a maximum score and was significantly higher than both Cloudera and MapR
A Leader in Hadoop
The Forrester Wave™ Big Data Hadoop Solutions Q1 2014
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
INRASTRUCTURE
Pivotal in the Modern Data Architecture
OPERATIONS TOOLS
Provision, Manage & Monitor
DEV & DATA TOOLS
Build & Test
DAT
A SY
STEM
S A
PPLI
CAT
ION
S
Repositories
ROOMS
Statistical Analysis
BI / Reporting, Ad Hoc Analysis
Interactive Web & Mobile Applications
Enterprise Applications
EDW MPP RDBMS
EDW MPP
SOU
RC
ES
OLTP, ERP, CRM Systems
Documents & Emails
Web Logs, Click Streams
Social Networks
Machine Generated
Sensor Data
Geo-location Data
On Premise, Cloud, Appliance
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
YARN Greenplum
Gemfire HAWQ
22 © Copyright 2014 Pivotal. All rights reserved. 22 © Copyright 2014 Pivotal. All rights reserved.
Turbo Charging Data Science with HAWQ
23 © 2015 Pivotal Software, Inc. All rights reserved.
Pivotal By the Numbers FOUNDED APRIL 2013
1700+ EMPLOYEES
FUNDED BY EMC, VMWARE, AND GE
HUNDREDS OF CUSTOMERS
PIVOTAL DATA >$100M in data software bookings in 2014
PIVOTAL CLOUD FOUNDRY Fastest revenue growth in an open source project in history
>$40M in first year for Pivotal Cloud Foundry in 2014 (subscription)
BIG DATA CLOUD PLATFORM AGILE
24 © 2015 Pivotal Software, Inc. All rights reserved.
Software is Eating the World
Data Is Fueling Software
25 © 2015 Pivotal Software, Inc. All rights reserved.
The Data Divide
BIG DATA CHASM
70% of data
generated by customers
80% of data stored
3% prepared for
analysis
0.5% being
analyzed
<0.5% being
operationalized
26 © Copyright 2014 Pivotal. All rights reserved.
Pivotal Business Data Lake Architecture Ingestion
Tier Insights
Tier System monitoring System management
Processing Tier
Workflow management
Distillation Tier
HDFS storage Unstructured and structured data
In-memory
MPP database
Real-time
Micro batch
Mega batch
SQL NoSQL
SQL MapReduce
Query interfaces
SQL
Sources Action Tier
Real-time ingestion
Micro batch ingestion
Batch ingestion
Real-time insights
Interactive insights
Batch insights
27 © 2015 Pivotal Software, Inc. All rights reserved.
The Data Driven Enterprise Journey STORE • Structured
• Unstructured
• High Volume
• High Velocity
ANALYZE • Predictive Analytics
• Machine Learning
• Advance Data Science
• Realtime Analytics
DEVELOP • Advanced Analytic Pipelines
• Realtime Analytical Applications
• Global Scale Data-Driven Applications
• Enterprise, Consumer, IoT, and Mobile
INNOVATE • Agile Dev Expertise
• DevOps
• Hybrid Cloud
• Continuous Delivery
• Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA PREDICTIVE ANALYTICS
ENTERPRISE PAAS
28 © 2015 Pivotal Software, Inc. All rights reserved.
Technical Observations • SQL is today and will remain the most valuable workload on Hadoop • While Hadoop continues to mature, focused MPP SQL will remain
important • Scale out in-memory processing will have significant enterprise
adoption and impact into the future • Streaming and Machine Learning will continue to gain value • Open Source is becoming critical to enterprise investment decisions
29 © Copyright 2015 Pivotal. All rights reserved.
®
Pivotal BDS + Hortonworks HDP = The Complete Solution
Pivotal Data Engineering Pivotal Labs Pivotal Data Science
HDP
30 © 2015 Pivotal Software, Inc. All rights reserved.
SQL on Hadoop Ecosystem HAWQ
Challenges Requirements • Complex joins not supported • Complex joins at performance
• Advanced analytics support • Advanced analytics at scale within SQL
• Interactive query latency issues • Fast interactive queries on large data
• Ad-hoc query performance issues • Strong ad-hoc query support in optimizer
• SQL analytic query coverage issues • Full analytic SQL compliance
• Concurrent query throughput issues • High query throughput for mixed workloads
31 © 2015 Pivotal Software, Inc. All rights reserved.
HAWQ HAWQ: Enterprise Class SQL on Hadoop • Leverages market leading Greenplum technology
• 100% ANSI SQL Compliant for analytic workloads
• Advanced cost-based query optimizer
• Highest performing SQL on Hadoop
• Polymorphic storage with advanced compression
• Industry differentiating data federation with PXF*
• Built-in advanced analytics for data science (MADLib)
• Supports all major file HDFS file formats (AVRO, Parquet, HDFS)
• Integrated with leading analytical tools out-of-the-box
HAWQ
*PXF = Pivotal eXtension Framework
32 © 2015 Pivotal Software, Inc. All rights reserved.
Business Benefits Feature Benefit Rich and compliant SQL dialect • Powerful and portable SQL apps
• Leverage large SQL-based ecosystems
TPC-DS compliance • Enable a wide range of use cases • Avoid surprises in production
Flexible/efficient joins at linear scale Off-load EDW workloads at a much lower cost
Deep analytics + machine learning Predictive/advanced learning use cases at scale
Data federation capabilities Build use cases with diverse/external data assets without data movement
High availability and fault tolerance Off-load business critical workloads from EDW
Native Hadoop file format support Reduce ETL and data movement = lower costs
HAWQ
33 © 2015 Pivotal Software, Inc. All rights reserved.
Pivotal Query Optimizer (PQO) For HAWQ and Greenplum Database
HAWQ
Turns a SQL query into an execution plan
Greenplum DB
� Leading Cost Based Optimizer for BIG data � Applies all possible optimizations at the same time
– Considers many more plan alternatives – Optimizes a wider range of queries – Optimizes memory usage
� New Extensible Code Base – Rapid adoption of emerging technologies
PIVOTAL VALUE-ADDED FUNCTIONALITY
34 © 2015 Pivotal Software, Inc. All rights reserved.
Configuring and Managing HAWQ with Ambari • Install HAWQ/PXF Ambari plugin
RPM
• Restart Ambari
• Add HAWQ/PXF service like any other Hadoop component
HAWQ
35 © 2015 Pivotal Software, Inc. All rights reserved.
Pivotal eXtension Framework (PXF) • Enables connectivity between HAWQ and
other services (Hive, HBase). • Provides an extensible framework to add
support for custom services • Operates as a separate service in Hadoop
Industry differentiators • Low latency on large data sets • Extensible and customizable • Considers cost model of federated sources
HAWQ
HDFS (Hadoop Distributed File System)
Hive
HBase P X F
Services
HAWQ
36 © 2015 Pivotal Software, Inc. All rights reserved.
Data Driven Journey with Pivotal Big Data Suite STORE • Structured
• Unstructured
• High Volume
• High Velocity
ANALYZE • Predictive Analytics
• Machine Learning
• Advance Data Science
• Realtime Analytics
DEVELOP • Advanced Analytic Pipelines
• Realtime Analytical Applications
• Global Scale Data-Driven Applications
• Enterprise, Consumer, IoT, and Mobile
INNOVATE • Agile Dev Expertise
• DevOps
• Hybrid Cloud
• Continuous Delivery
• Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA PREDICTIVE ANALYTICS
ENTERPRISE PAAS
Spring XD
Spark
Pivotal HD & Open Data Platform
Spring XD
Pivotal Greenplum Database
Pivotal HAWQ
Spring XD
Pivotal GemFire
Redis
Rabbit MQ
Spring IO
Groovy
Pivotal BDS on PCF
Pivotal Cloud Foundry
Pivotal Labs Data Science Data Engineering
37 © 2015 Pivotal Software, Inc. All rights reserved.
Putting it All Together DATA FEEDS TRANSACTIONAL APPS ANALYTIC APPS
Expert Systems & Machine Learning
Advanced Analytics
Real-Time Data
Data Stream Pipeline
HDFS Data Lake
Distributed Computing
38 © 2015 Pivotal Software, Inc. All rights reserved.
Putting it All Together DATA FEEDS TRANSACTIONAL APPS ANALYTIC APPS
GemFire
Ingest Filter Enrich Sink SpringXD
HAWQ GPDB
39 © Copyright 2015 Pivotal. All rights reserved.
Demo: HAWQ on HDP bit.ly/HAWQonHDPVideo
Tutorial: HAWQ on Sandbox
bit.ly/HAWQonHDPTutorial
Page 40
© 2015 Open Data Platform initiative. All rights reserved.
THE OPEN DATA PLATFORM INITIATIVE
Page 41
© 2015 Open Data Platform initiative. All rights reserved.
Introducing The Open
Data Platform Initiative
Page 42
© 2015 Open Data Platform initiative. All rights reserved.
A shared industry effort to help promote and advance the state of Apache Hadoop® and Big Data
technologies for the Enterprise
43 © Copyright 2014 Pivotal. All rights reserved.
The Open Data Platform will accelerate the delivery of Big Data solutions by providing a well-defined
platform called ‘The ODP Core’
Page 44
© 2015 Open Data Platform initiative. All rights reserved.
The ODP Core
▪ The ODP Core is the kernel over which the industry can build enterprise-class Apache Hadoop® solutions
– Simplifying development of interoperable technologies ▪ Created by the ODP Developer Community
– A team of cross industry technical experts
– Individual, or member company developers – anyone can participate
▪ Using an open and transparent planning and release process that follows the Apache Way
– Interoperability within and beyond the ODP Core drives a broad set of use cases and rapid market growth
Page 45
© 2015 Open Data Platform initiative. All rights reserved.
Delivering Enterprise
Requirements & Real-world Experience
ODP Member Companies
• Diverse representation of the Big Data eco-system – End users, ISVs, Systems Integrators, Distribution vendors, etc.
– Any company can join the Open Data Platform
• A forum for the Enterprise to define its Big Data requirements – Industry groups (SIGs) to align on common industry practices and
challenges • Direct feedback and participation in the ODP Core
– Real world experience determining what is Enterprise grade
Page 46
© 2015 Open Data Platform initiative. All rights reserved.
A Simple Beginning For The ODP Core
▪ The ODP Core is starting with a small number of projects – Enables a rapid start for the Initiative and an industry driven definition
▪ All members decide how the ODP Core evolves – All members are responsible for choosing projects to include in the ODP Core
– Platinum, Gold and Silver member companies = One Member / One Vote
HDFS
YARN
Map Reduce
Ambari
ü Deployable Hadoop configuration ü Improves interoperability ü Gives customers more freedom ü Follows the Apache Way
ODP Core Initial Projects
47 © Copyright 2014 Pivotal. All rights reserved.
Quickly Showing Value To The Industry
Common core
HDP 2.2 Open Platform 4.0 with Apache Hadoop
IIP
Key benefits Improves ecosystem interoperability
Unlocks customer choice
Eliminates wasteful guesswork
Respects the Apache way
Hortonworks, IBM, Pivotal and InfoSys Harmonize on Open Data Platform Vision to Accelerate Big Data Solutions
Apache Hadoop 2.6 Apache Ambari
Pivotal HD 3.0
Page 48
© 2015 Open Data Platform initiative. All rights reserved.
How You Can Participate
§ Anybody can join the ODP – Company memberships start at $1k
§ Have a direct voice into the future of big data
§ Help us define priorities to solve your challenges
§ Join your peers and accelerate industry solutions
§ Contribute people, tests, and code to accelerate executing on the vision
ODP - enabling Big Data solutions to flourish atop a
common core platform