pivotal: hadoop for powerful processing of unstructured data for valuable insights
DESCRIPTION
Pivotal has setup and operationalized 1000 node Hadoop cluster called the Analytics Workbench. It takes special setup and skills to manage such a large deployment. This session shares how we set it up and how you will manage it. Objective 1: Understand what it takes to operationalize a 1000-nodeHadoop cluster. After this session you will be able to: Objective 2: Understand how to set up and manage the day to day challenges of a large Hadoop deployments. Objective 3: Have a view to the tools that are necessary to solve the challenges of managing the large Hadoop cluster.TRANSCRIPT
1 © Copyright 2013 EMC Corporation. All rights reserved.
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Insights SK Krishnamurthy [email protected]
2 © Copyright 2013 EMC Corporation. All rights reserved.
Traditional Enterprise Analytics Process
3 © Copyright 2013 EMC Corporation. All rights reserved.
The Fundamental Paradigm Shift
Internet age and exploding data growth
Enterprises leverage new data sources to identify emerging trends and opportunities
Traditional database tools not able to cope
4 © Copyright 2013 EMC Corporation. All rights reserved.
Enter Hadoop
Flexible
Scalable
Inexpensive
Fault-tolerant
Rapidly Adopted
Platform for Big
Data
5 © Copyright 2013 EMC Corporation. All rights reserved.
Evolution of Process with Hadoop
6 © Copyright 2013 EMC Corporation. All rights reserved.
$-
$20,000
$40,000
$60,000
$80,000
2008 2009 2010 2011 2012 2013
Big Data Platform Price/TB
Big Data DB Hadoop
HDFS Economics Have Changed the Game
Big Data RDBMS pricing will
ultimately converge with
Hadoop pricing
The price per TB of Big Data RDMBS has
been consistently eroding over time.
Hadoop pricing has increased slightly over
time as vendors have injected value added
services into the ecosystem.
7 © Copyright 2013 EMC Corporation. All rights reserved. 7 © Copyright 2013 Pivotal. All rights reserved.
Where We’re Going
8 © Copyright 2013 EMC Corporation. All rights reserved.
Big Data Platform
Analytical Query Operational Intelligence
In-Memory DB
Run-Time Applications
In-Memory Objects
Enterprise Data Warehouse
RDBMS
Continues to serve as system of record
HDFS
Data Staging Platform
Data Mgmt. Services
Data Visualization
Compliance and financial reporting
Traditional BI/Reporting
Pivotal Data Platform
Data Visualization
Stream Ingestion
Streaming Services
9 © Copyright 2013 EMC Corporation. All rights reserved.
Flexible Deployment Model
deploy
Portable Elastic HW Abstracted Manageable “Consumer” grade
Public Cloud On Premise Private Cloud
10 © Copyright 2013 EMC Corporation. All rights reserved.
PIVOTAL HD The world’s most powerful Hadoop distribution
11 © Copyright 2013 EMC Corporation. All rights reserved.
Pivotal HD World’s first true SQL processing for enterprise-ready
Hadoop
100% Apache Hadoop-based platform
Virtualization and cloud ready with VMWare and Isilon
Scale tested in 1000 node Pivotal Analytics Workbench
Available as a software-only or appliance-based solution
Backed by EMC’s global, 24x7 support infrastructure
12 © Copyright 2013 EMC Corporation. All rights reserved.
Pivotal Hadoop Distributions
GPHD Pivotal HD
100% Open Source Compatible
Apache Hadoop 1.x Apache Hadoop 2.x
13 © Copyright 2013 EMC Corporation. All rights reserved.
• HDFS – The Hadoop Distributed File System acts as the storage layer for Hadoop
• MapReduce – Parallel processing framework used for data computation in Hadoop
• Hive – Structured, data warehouse implementation for data in HDFS that provides a SQL-like interface to Hadoop
• Pig – High-level procedural language for data pipeline/data flow processing in Hadoop
• HBase – NoSQL, key-value data store on top of HDFS
• Mahout – Library of scalable machine-learning Algorithms
• Spring Hadoop – Integrates the Spring framework into Hadoop
Pivotal HD Components
14 © Copyright 2013 EMC Corporation. All rights reserved.
• Installation and Configuration Manager (ICM) – cluster installation, upgrade, and expansion tools.
• GP Command Center – visual interface for cluster health, system metrics, and job monitoring.
• Hadoop Virtualization Extension (HVE) – enhances Hadoop to support virtual node awareness and enables greater cluster elasticity.
• GP Data Loader – parallel loading infrastructure that supports “line speed” data loading into HDFS.
• Isilon Integration – extensively tested at scale with guidelines for compute-heavy, storage-heavy, and balanced configurations.
• Advanced Database Services (HAWQ)– high-performance, “True SQL” query interface running within the Hadoop cluster.
• Extensions Framework (GPXF) – support for HAWQ interfaces on external data providers (HBase, Avro, etc.).
• Advanced Analytics Functions (MADLib) – ability to access parallelized machine-learning and data-mining functions at scale.
GPHD Includes… Pivotal HD Adds the Following to GPHD…
Pivotal HD Value-Added Components
15 © Copyright 2013 EMC Corporation. All rights reserved.
Component Version
Hadoop 1.0.3
HBase 0.92.1
Hive 0.8.1
Mahout 0.6
Pig 0.9.2
Zookeeper 3.3.5
Flume 1.2.0
Sqoop 1.4.1
Spring Hadoop
GPHD 1.2 Core Distribution Pivotal HD Enterprise
Pivotal Core Components & Versions
Component Version
Hadoop 2.0.2
HBase 0.94.2
Hive 0.9.1
Mahout 0.8.0
Pig 0.10.0
Zookeeper 3.4.3
Flume 1.2.0
Sqoop 1.4.1
Spring Hadoop
16 © Copyright 2013 EMC Corporation. All rights reserved.
Pivotal HD Architecture
HDFS
HBase
Pig, Hive, Mahout
Map Reduce
Sqoop Flume
Resource Management & Workflow
Yarn
Zookeeper
Apache
17 © Copyright 2013 EMC Corporation. All rights reserved.
Pivotal HD Architecture
HDFS
HBase
Pig, Hive, Mahout
Map Reduce
Sqoop Flume
Resource Management & Workflow
Yarn
Zookeeper
Deploy, Configure, Monitor, Manage
Command Center
Hadoop Virtualization (HVE)
Data Loader
Pivotal HD Enterprise
Apache Pivotal HD Enterprise
18 © Copyright 2013 EMC Corporation. All rights reserved.
Pivotal HD Architecture
HDFS
HBase
Pig, Hive, Mahout
Map Reduce
Sqoop Flume
Resource Management & Workflow
Yarn
Zookeeper
Deploy, Configure, Monitor, Manage
Command
Center
Data Loader
Pivotal HD Enterprise
Apache Pivotal HD Enterprise HAWQ
Xtension Framework
Catalog Services
Query Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ– Advanced Database Services
Hadoop Virtualization (HVE)
19 © Copyright 2013 EMC Corporation. All rights reserved.
DataLoader
.
.
Streams
Push
Pull
Connectors
Flume
HDFS
DataLoader
Data Source Registration
Copy Strategy
Optimization
Web GUI and CLI
Data Destination Registration
Data Copy
Job Management
Data Processing
REST APIs
Files
HDFS
NFS
HTTP
FTP
Local
20 © Copyright 2013 EMC Corporation. All rights reserved.
Command Center Simple and complete cluster management
Install and configure Hadoop components and services
Centralized interface for Pivotal HD cluster monitoring, diagnostics, and management
Live and historical Hadoop system metrics analysis
Configure
Monitor
Manage
Analyze
Deploy
21 © Copyright 2013 EMC Corporation. All rights reserved.
Command Center – Monitor, Manage, and Analyze Host, application, and job level
monitoring across the entire Pivotal HD cluster performance
Visualize and analyze live and historical Hadoop cluster information through Command Center Dashboard
Quick diagnostics of functional or performance issue
22 © Copyright 2013 EMC Corporation. All rights reserved.
Hadoop Virtualization Extensions (HVE) • HVE enables Hadoop to support more effective virtual deployments
• This creates the opportunity to provision and scale the compute and storage processes independently resulting in:
• Much better resource utilization
• Improved resource allocation and consumption
• Support Multi-Tenancy
23 © Copyright 2013 EMC Corporation. All rights reserved. 23 © Copyright 2013 Pivotal. All rights reserved.
HAWQ
24 © Copyright 2013 EMC Corporation. All rights reserved.
HAWQ: The Crown Jewels of Greenplum SQL compliant
World-class query optimizer
Interactive query
Horizontal scalability
Robust data management
Common Hadoop formats
Deep analytics
25 © Copyright 2013 EMC Corporation. All rights reserved.
High-Performance Query Processing HAWQ
Interactive and true ANSI SQL support
Multi-petabyte horizontal scalability
Cost-based parallel query optimizer
Programmable analytics
26 © Copyright 2013 EMC Corporation. All rights reserved.
Enterprise-Class Database Services & Management HAWQ
Scatter-gather data loading
Row and column storage
Workload management
Multi-level partitioning
3rd-party tool & open client interfaces
27 © Copyright 2013 EMC Corporation. All rights reserved.
Pre-integrated Deep Analytics HAWQ
Performance via fully parallelized implementation
Consistent, user friendly SQL interfaces
Ease of data preparation
Pre-integrated MADLib support – Linear Regression
– Logistic Regression – Multinomial Logisitic
Regression
– K-Means – Association Rules – PLDA - useful for topic
modeling
28 © Copyright 2013 EMC Corporation. All rights reserved.
GPDB – Components
GPDB
Query Engine Catalog Service
Local File System Res
ourc
e M
anag
emen
t
GPXF
Planner Optimizer
Executor Transaction Manager
29 © Copyright 2013 EMC Corporation. All rights reserved.
HAWQ – Components
GPSQL
Query Engine Catalog Service
HDFS
Res
ourc
e M
anag
emen
t
GPXF
Planner Optimizer
Executor Transaction Manager
30 © Copyright 2013 EMC Corporation. All rights reserved.
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host . . . Query Executor Query Executor Query Executor
Clients
JDBC/ODBC
SQL Console
SELECT beer, price FROM Bars b, Sells s WHERE b.name = s.bar AND b.city = ‘San Francisco’
HDFS Namenode
HAWQ Master Host
Query Optimizer
Query Parser
How HAWQ Works
31 © Copyright 2013 EMC Corporation. All rights reserved.
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host . . . Query Executor Query Executor Query Executor
Clients
JDBC/ODBC
SQL Console HDFS Namenode
HAWQ Master Host
Query Optimizer
Query Parser
How HAWQ Works Optimization
Context
Cost Model
Resources
Parse Tree
Metadata
32 © Copyright 2013 EMC Corporation. All rights reserved.
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host . . . Query Executor Query Executor Query Executor
Clients
JDBC/ODBC
SQL Console HDFS Namenode
HAWQ Master Host
Query Optimizer
Query Parser
How HAWQ Works Execution Plan
33 © Copyright 2013 EMC Corporation. All rights reserved.
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host . . . Query Executor Query Executor Query Executor
Clients
JDBC/ODBC
SQL Console HDFS Namenode
HAWQ Master Host
Query Optimizer
Query Parser
How HAWQ Works
34 © Copyright 2013 EMC Corporation. All rights reserved.
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host . . . Query Executor Query Executor Query Executor
Clients
JDBC/ODBC
SQL Console HDFS Namenode
HAWQ Master Host
Query Optimizer
Query Parser
How HAWQ Works
D y n a m i c P i p e l i n i n g ™
35 © Copyright 2013 EMC Corporation. All rights reserved.
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host
HDFS Datanode
HAWQ Segment Host . . . Query Executor Query Executor Query Executor
Clients
JDBC/ODBC
SQL Console HDFS Namenode
HAWQ Master Host
Query Optimizer
Query Parser
How HAWQ Works
36 © Copyright 2013 EMC Corporation. All rights reserved.
HAWQ Deployment
Dynamic Pipelining
... ...
... ... Master
Servers & Name Nodes
Query planning & dispatch
Segment Servers & Data
Nodes Query processing &
data storage
External Sources Loading,
streaming, etc.
HDFS
ODBC/JDBC Driver
37 © Copyright 2013 EMC Corporation. All rights reserved.
Xtension Framework An advanced version of GPDB
external tables
Enables combining HAWQ data and Hadoop data in single query
Supports connectors for HDFS, Hbase and Hive
Provides extensible framework API to enable custom connector development for other data sources
HDFS HBase Hive
Xtension Framework
38 © Copyright 2013 EMC Corporation. All rights reserved.
HAWQ Benchmarks
User intelligence 4.2 198
Sales analysis 8.7 161
Click analysis 2.0 415
Data exploration 2.7 1,285
BI drill down 2.8 1,815
47X
19X
208X
476X
648X
39 © Copyright 2013 EMC Corporation. All rights reserved.
Pivotal Analytics Workbench (AWB) Commitment to Accelerating Innovation & Contributing to the Apache Community • Multi-million dollar investment by Pivotal and partners
in a 1,000-node, 24-Petabyte cluster to facilitate innovation and conduct regular integration/scale testing of Apache Hadoop
• Full-time, dedicated integration onboarding projects and validating each release of Apache Hadoop at-scale
• Contributing back our results and findings to the open source community as well as incorporating them into the continued development of Pivotal HD
40 © Copyright 2013 EMC Corporation. All rights reserved.
“Real” Hadoop Cluster
41 © Copyright 2013 EMC Corporation. All rights reserved.
Leveraging Full Power of the Family
42 © Copyright 2013 EMC Corporation. All rights reserved.
Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-Data-Driven Applications
Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F
Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench
Clinton Ooi Bhavin Modi
Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A
Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights
SK Krishnamurthy
Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M
Pivotal: Big & Fast data – merging real-time data and deep analytics
Michael Crutcher
Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M
Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette
Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E
Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005