pivotal: virtualize big data to make the elephant dance

34
1 © Copyright 2013 EMC Corporation. All rights reserved. Virtualize Big Data to Make the Elephant Dance June Yang, Senior Director of Product Management, VMWare Dan Baskett, Senior Consultant Technologist, Pivotal

Upload: emc-academic-alliance

Post on 26-Jan-2015

104 views

Category:

Technology


1 download

DESCRIPTION

Big Data and virtualization are two of the hottest trends in the industry today, yet the full potential for bringing the two together has not been fully realized. In this session, learn how virtualization brings the advantages of greater elasticity, stronger isolation for multi-tenancy, and a click HA protection to Hadoop, while maintaining the comparable performance to Hadoop on physical machines. Objective 1: Understand the benefits of virtualizing Hadoop. After this session you will be able to: Objective 2: Understand how to get started with Pivotal HD Hadoop . Objective 3: Understand where to find more information.

TRANSCRIPT

Page 1: Pivotal: Virtualize Big Data to Make the Elephant Dance

1 © Copyright 2013 EMC Corporation. All rights reserved.

Virtualize Big Data to Make the Elephant Dance

June Yang, Senior Director of Product Management, VMWare Dan Baskett, Senior Consultant Technologist, Pivotal

Page 2: Pivotal: Virtualize Big Data to Make the Elephant Dance

2 © Copyright 2013 EMC Corporation. All rights reserved.

Unstructured Data is exploding… Hadoop is driving growth

Unstructured data driving growth Hadoop adoption is ramping

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Structured Unstructured

Complex unstructured data forecasted to outpace structured

relational data by 10x by 2020

Evaluating53%In-

production23%

Piloting18%

Testing2%

Don't know2%

Other2%

Source: Forrester Survey of 60 CIOs , September 2011

• Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider Enterprise data strategy

• Gartner predicts +800% data growth over next 5 years • Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs

Page 3: Pivotal: Virtualize Big Data to Make the Elephant Dance

3 © Copyright 2013 EMC Corporation. All rights reserved.

Log Processing / Click Stream Analytics

Machine Learning / sophisticated data mining

Web crawling / text processing

Extract Transform Load (ETL) replacement

Image / XML message processing

Broad Application of Hadoop Technology

General archiving / compliance

Financial Services

Mobile / Telecom

Internet Retailer

Scientific Research

Pharmaceutical / Drug Discovery

Social Media

Vertical Industries Use Cases

Hadoop is a platform that will revolutionize how Enterprises handle data

Page 4: Pivotal: Virtualize Big Data to Make the Elephant Dance

4 © Copyright 2013 EMC Corporation. All rights reserved.

The Big Data Journey in the Enterprise

Stage 3: Cloud Analytics Platform • Serve many departments

• Often part of mission critical workflow • Fully integrated with analytics/BI tools

Stage1: Hadoop Piloting • Often start with line of business • Try 1 or 2 use cases to explore

the value of Hadoop

Stage 2: Hadoop Production

• Serve a few departments • More use cases

• Growing # and size of clusters • Core Hadoop + components

10’s 100’s 0 node

Integrated

Scale

Page 5: Pivotal: Virtualize Big Data to Make the Elephant Dance

5 © Copyright 2013 EMC Corporation. All rights reserved.

Deploy Hadoop Clusters in Minutes

Page 6: Pivotal: Virtualize Big Data to Make the Elephant Dance

6 © Copyright 2013 EMC Corporation. All rights reserved.

One click to scale out your cluster on the fly

Page 7: Pivotal: Virtualize Big Data to Make the Elephant Dance

7 © Copyright 2013 EMC Corporation. All rights reserved.

Customize your Hadoop/Hbase Cluster

Customize with Cluster Specification File

Page 8: Pivotal: Virtualize Big Data to Make the Elephant Dance

8 © Copyright 2013 EMC Corporation. All rights reserved.

Cluster Spec File Details

Resource configuration

Cluster Specification File "groups":[ { "name":"master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": { "type": "SHARED”, sizeGB": 20}, "instance_type":MEDIUM, "instance_num":1, "ha":true}, {"name":"worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instance_type":SMALL, "instance_num":5, "ha":false …

Storage configuration Choice of shared storage or Local disk

High availability option

# of Hadoop nodes

Page 9: Pivotal: Virtualize Big Data to Make the Elephant Dance

9 © Copyright 2013 EMC Corporation. All rights reserved.

Your Choice of Hadoop Distributions and Tools Community Projects Distributions

• Flexibility to choose and try out major distributions • Support for multiple projects • Open architecture to welcome industry participation • Contributing Hadoop Virtualization Extensions (HVE) to open source

community

Page 10: Pivotal: Virtualize Big Data to Make the Elephant Dance

10 © Copyright 2013 EMC Corporation. All rights reserved.

Proactive monitoring with VCOPs Proactively monitoring through VCOPs Gain comprehensive visibility Eliminate manual processes with intelligent automation Proactively manage operations Alternatively, use monitoring tools like Nagios, Ganglia

Page 11: Pivotal: Virtualize Big Data to Make the Elephant Dance

11 © Copyright 2013 EMC Corporation. All rights reserved.

Beyond day 1 - Automation of Hadoop Cluster lifecycle management

Deploy

Customize

Load data

Execute jobs

Tune configuration

Scaling

Page 12: Pivotal: Virtualize Big Data to Make the Elephant Dance

12 © Copyright 2013 EMC Corporation. All rights reserved.

The Big Data Journey in the Enterprise

Stage1: Hadoop Piloting Rapid deployment

On the fly cluster resizing Choice of Hadoop distros

Automation of cluster lifecycle

Stage 2: Hadoop Production • Serve a few departments

• More use cases • Growing # and size of clusters

• Core Hadoop + components

Integrated

10’s 100’s 0 node Scale

Page 13: Pivotal: Virtualize Big Data to Make the Elephant Dance

13 © Copyright 2013 EMC Corporation. All rights reserved.

Achieve HA for the Entire Hadoop Stack

HDFS (Hadoop Distributed File System)

HBase (Key-Value store)

MapReduce (Job Scheduling/Execution System)

Pig (Data Flow) Hive (SQL)

BI Reporting ETL Tools

Man

agem

ent

Ser

ver

Zook

eepr

(C

oord

inat

ion)

HCatalog

RDBMS

Namenode

Jobtracker

Hive MetaDB Hcatalog MDB

Server • vSphere HA is battle-tested high availability technology • Single mechanism to achieve HA for the entire Hadoop stack • One click to enable HA and/or FT

Page 14: Pivotal: Virtualize Big Data to Make the Elephant Dance

14 © Copyright 2013 EMC Corporation. All rights reserved.

Challenges of Running Hadoop in Enterprises

Production

Test

Experimentation

Dept A: recommendation engine Dept B: ad targeting

Production

Test

Experimentation

Log files

Social data Transaction data Historical cust behavior

Pain Points: 1. Cluster sprawling 2. Redundant common data in

separate clusters 3. Difficult use the right tool for

the right problem 4. Peak compute and I/O

resource is limited to number of nodes in each independent cluster

NoSQL Real time SQL …

On the horizon…

Page 15: Pivotal: Virtualize Big Data to Make the Elephant Dance

15 © Copyright 2013 EMC Corporation. All rights reserved.

What if you can…

Experimentation

Production recommendation engine

Production Ad Targeting

Test/Dev

Production

Test

Production

Test

Experimentation

Recommendation engine Ad targeting

Experimentation

One physical platform to support multiple virtual big data clusters

Page 16: Pivotal: Virtualize Big Data to Make the Elephant Dance

16 © Copyright 2013 EMC Corporation. All rights reserved.

Bigger is Better

Hadoop is linearly scalable, more nodes, better performance, for the same job, it will take

– 2 hour to complete on a 50 node cluster – 1 hour to complete on a 100 node cluster – 30 min to complete on a 200 node cluster

Page 17: Pivotal: Virtualize Big Data to Make the Elephant Dance

17 © Copyright 2013 EMC Corporation. All rights reserved.

You may ask

What about differentiated SLAs – For production Hadoop jobs, need to ensure high priority

– Lower priority of experimental Hadoop jobs.

Will I have a noisy neighbor problems with shared infrastructure approach?

Page 18: Pivotal: Virtualize Big Data to Make the Elephant Dance

18 © Copyright 2013 EMC Corporation. All rights reserved.

VM Containers with Isolation are a Tried and Tested Approach

Host Host Host Host Host Host

VMware vSphere + Serengeti

Host

Hungry Workload 1 Reckless Workload 2

Noisy Workload 3

Page 19: Pivotal: Virtualize Big Data to Make the Elephant Dance

19 © Copyright 2013 EMC Corporation. All rights reserved.

Shared infrastructure: Three big types of Isolation are Required

Resource Isolation • Control the greedy noisy neighbor

• Reserve resources to meet needs

Version Isolation • Allow concurrent OS, App, Distro versions

Security Isolation • Provide privacy between users/groups

• Runtime and data privacy required

Host Host Host Host Host Host

VMware vSphere + Serengeti

Host

Page 20: Pivotal: Virtualize Big Data to Make the Elephant Dance

20 © Copyright 2013 EMC Corporation. All rights reserved.

With virtualization, you can have your cake and eat it too

One physical platform to support multiple virtual big data clusters

– Share data to minimize copying – Single infrastructure to

maintain – Bigger cluster for better

performance – Share hardware resource to

achieve higher utilization

Virtualization ensures strong isolation between clusters.

– Resource isolation. – Failure isolation – Configure isolation – Security isolation

Compute layer

Data layer

VMware vSphere + Serengeti

High Priority

Low Priority Experimentation

Production recommendation engine

Production Ad Targeting

Test/Dev

Page 21: Pivotal: Virtualize Big Data to Make the Elephant Dance

21 © Copyright 2013 EMC Corporation. All rights reserved.

Storage

Elastic Hadoop with Virtualization

Compute Combined Storage/Compute Storage

T1 T2 VM VM VM

VM VM

VM

Unmodified Hadoop node in a VM VM lifecycle

determined by Datanode

Limited elasticity

Separate Compute from Storage Separate compute

from data Stateless compute Elastic compute

Separate Virtual Compute Clusters per tenant Separate virtual compute Compute cluster per tenant Stronger VM-grade security

and resource isolation

Hadoop Node

Page 22: Pivotal: Virtualize Big Data to Make the Elephant Dance

22 © Copyright 2013 EMC Corporation. All rights reserved.

Scale in/out Hadoop dynamically Deploy separate compute clusters for different tenants sharing HDFS.

Commission/decommission task trackers according to priority and available resources

Experimentation Dynamic resourcepool

Data layer

Production recommendation engine

Compute layer Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Compute VM

Experimentation Production

Compute VM

Job Tracker

Job Tracker

VMware vSphere + Serengeti

Page 23: Pivotal: Virtualize Big Data to Make the Elephant Dance

23 © Copyright 2013 EMC Corporation. All rights reserved.

The Big Data Journey in the Enterprise

Stage1: Hadoop Piloting Rapid deployment

On the fly cluster resizing Choice of Hadoop distros

Automation of cluster lifecycle

Stage 2: Hadoop Production High Availability Consolidation

Differentiated SLAs Elastic Scaling

Integrated Stage 3: Cloud Analytics Platform • Serve many departments

• Often part of mission critical workflow • Fully integrated with analytics/BI tools

10’s 100’s 0 node Scale

Page 24: Pivotal: Virtualize Big Data to Make the Elephant Dance

24 © Copyright 2013 EMC Corporation. All rights reserved.

Cloud Analytics Platform

E T L

Real Time Streams

HDFS

Real Time Structured Database

Data Warehouse

Unstructured and Batch Processing

Stream Processing

Compute Storage Networking Cloud Infrastructure

Automated Models

Business Intelligence

Machine Learning CETAS

Data Visualization

Page 25: Pivotal: Virtualize Big Data to Make the Elephant Dance

25 © Copyright 2013 EMC Corporation. All rights reserved.

Big Data Tools and Characteristics Framework Scale of

data Scale of Cluster

Computable Data?

Local Disks?

Map-reduce: Hadoop

100s PB 10s to 1,000s Yes Yes, for cost, bandwidth and availability

Big-SQL: HawQ,, Aster Data, Impala, …

PB’s 10s to 100s Some Yes, for cost and bandwidth

No-SQL: Cassandra, hBase, …

Trilions Of rows

10s to 100s Some Yes, for cost and availability

In-Memory: Redis, Gemfire, Membase, …

Billions of rows 10s-100s Yes Primarily Memory

Page 26: Pivotal: Virtualize Big Data to Make the Elephant Dance

26 © Copyright 2013 EMC Corporation. All rights reserved.

Choose a platform that… Allows user to pick the right tools at the right

time Put resources where needed based on SLA policy

Page 27: Pivotal: Virtualize Big Data to Make the Elephant Dance

27 © Copyright 2013 EMC Corporation. All rights reserved.

Ad hoc data mining

In-house Hadoop as a Service – (Hadoop + Hadoop)

Compute layer

Data layer

HDFS

Host Host Host Host Host Host

Production recommendation engine

Production ETL of log files

VMware vSphere + Serengeti

HDFS

Page 28: Pivotal: Virtualize Big Data to Make the Elephant Dance

28 © Copyright 2013 EMC Corporation. All rights reserved.

Hadoop batch analysis

Integrated Big Data Production – (Mixed big data workloads)

HDFS

Host Host Host Host Host Host

HBase real-time queries

NoSQL – Cassandra key-value store

MPP DBMS – Analysis of structured data

Compute layer

Data layer

VMware vSphere + Serengeti

Page 29: Pivotal: Virtualize Big Data to Make the Elephant Dance

29 © Copyright 2013 EMC Corporation. All rights reserved.

Short-lived Hadoop compute cluster

Integrated Hadoop and Webapps – (Big Data + Other Workloads)

HDFS

Host Host Host Host Host Host

Web servers for ecommerce site

Compute layer

Data layer

Hadoop compute cluster

VMware vSphere + Serengeti

Page 30: Pivotal: Virtualize Big Data to Make the Elephant Dance

30 © Copyright 2013 EMC Corporation. All rights reserved.

The Big Data Journey in the Enterprise

Stage1: Hadoop Piloting Rapid deployment

On the fly cluster resizing Choice of Hadoop distros

Automation of cluster lifecycle

Stage 2: Hadoop Production

High Availability Consolidation

Differentiated SLAs Elastic Scaling

Integrated

Stage 3: Cloud Analytics Platform Mixed workloads

Right tool at the right time Flexible and elastic infrastrure

10’s 100’s 0 node Scale

Page 31: Pivotal: Virtualize Big Data to Make the Elephant Dance

31 © Copyright 2013 EMC Corporation. All rights reserved.

Learn More Download and try Serengeti

– projectserengeti.org • VMware Hadoop site

– vmware.com/hadoop

• Hadoop performance on vSphere white paper

– http://www.vmware.com/files/pdf/techpaper/hadoop-vsphere51-32hosts.pdf

• Hadoop virtualization extensions (HVE) Whitepaper

– http://www.vmware.com/files/pdf/techpaper/hadoop-vsphere51-32hosts.pdf

Page 32: Pivotal: Virtualize Big Data to Make the Elephant Dance

32 © Copyright 2013 EMC Corporation. All rights reserved.

Thank You!

June Yang Senior Director, VMware [email protected]

Dan Baskette Senior Consultant Technologist [email protected]

Page 33: Pivotal: Virtualize Big Data to Make the Elephant Dance

33 © Copyright 2013 EMC Corporation. All rights reserved.

Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-Data-Driven Applications

Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005

Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action

Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F

Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench

Clinton Ooi Bhavin Modi

Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A

Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights

SK Krishnamurthy

Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M

Pivotal: Big & Fast data – merging real-time data and deep analytics

Michael Crutcher

Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M

Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette

Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E

Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005

Page 34: Pivotal: Virtualize Big Data to Make the Elephant Dance