key trends in big data and new reference architecture from hewlett packard enterprise / gilles...

A modern, flexible approach to Hadoop implementationHPE Big Data Reference Architecture

Gilles NoisetteHPE EMEA Big Data Center Of Excellence

November 2015

Agenda

• Big Data IT infrastructure trends

• Hadoop Evolution & Architecture trends

• Hadoop YARN Labelling

• Hadoop Storage Tiering

• New HPE Architecture approach to Big Data

• HPE Big Data Reference Architecture

• Scaling Hadoop more efficiently

• HPE BDRA Components

• HPE BDRA in a virtualized context

• HPE Big Data Architecture long term view

IT infrastructures must evolve to handle Big Data demands

• Multiple silos with multiple copies of the same data

• Difficult to standardize on a consistent server architecture

• Less elastic than other virtualized or converged infrastructure

• Large scale makes density, cost and power problematic

Challenges

The Analytic Cycle

The Pace of Change

The Pace of Change

And how people are buying Hadoop is changing also….

Hadoop YARN LabellingRunning applications on particular set of nodes

YARN Labelling (Node-labels / Hadoop 2.6 / jira YARN-796)

Capability to create groups of similar nodes to run different types of applications with different workload, each, on the most appropriate group of node

• Admin tags nodes with labels (e.g.: GPU, Storm)

− One node can have more than one label (e.g.: GPU, m710)

• Applications can include labels in container requests

Enabling the next Generation of Hadoop Applications . . .

NodeManager

[Storm]

Application

Master

I want a GPU

NodeManager

[GPU, m710]

HPE Moonshot cartridge

NodeManager

[Analytic, XL170r]

HPE Apollo blades

YARN Labels are used in productionYARN Labelling case studies

Vinod Vavilapalli – @Tshooter

Yahoo! uses machines with GPUs on #Hadoop clusters (#YARN) to model

'beautiful' images on Flickr. #hadoopsummit

1:43 AM - 16 Apr 2015

Vinod Vavilapalli – @Tshooter

.@pcnudde talking about #Yahoo using custom #Hadoop #YARN apps together

with Node labels / High CPU machines for learning. #hadoopsummit

1:49 AM - 16 Apr 2015

Yahoo uses YARN labels

eBay cluster use YARN labels to

• Separate Machine Learning workloads from regular workloads

• Separate licensed software to some machines

• Enable GPU workloads

• Separate organizational workloadsMayank Bansal, ebay

https://mobile.twitter.com/hashtag/Hadoop?src=hash

https://mobile.twitter.com/hashtag/YARN?src=hash

https://mobile.twitter.com/hashtag/hadoopsummit?src=hash

https://mobile.twitter.com/pcnudde

https://mobile.twitter.com/hashtag/Yahoo?src=hash

https://mobile.twitter.com/hashtag/Hadoop?src=hash

https://mobile.twitter.com/hashtag/YARN?src=hash

https://mobile.twitter.com/hashtag/hadoopsummit?src=hash

https://mobile.twitter.com/Tshooter




Hadoop Storage tieringHadoop Architecture trends

HDFS Tiering / Heterogeneous Storage Tiers (HDFS-2832)

Allows a single cluster to have multiple storage tiers such as ARCHIVE, DISK, SSD, RAM-disk.

Awareness of storage media allow HDFS to make better decisions about the placement of block data with input from applications. Distribution of replicas could be based on its performance and durability requirements.

• Phase2:

–HDFS-5682 - Application APIs for heterogeneous storage

–HDFS-7228 - SSD storage tier

–HDFS-5851 - Memory as a storage tier

HDFS Archival Storage Design (HDFS-6584)– Introduces a new concept of storage policies. For accommodating future storage

technology and different cluster characteristics, cluster administrators will be able to

modify the predefined storage policies and/or define custom storage policies.

– Data policy names : Very Hot Hot Warm Luke Warm Cold

Ebay use Tiered Storage for its Hadoop clusterHDFS Tiering case study

40 PB / 2000 nodes cluster was getting full

HDFS Tiering features

• Data reside on same cluster in a standard HDFS

• Data could easily move back and forth, to and from, the Archive

• Tiered storage is operated using storage types and storage policies

• Archival policy is based on access pattern

– Antony Benoy, ebay

40 PB / 2000 nodes

DISK

10 PB / 48 nodes

ARCHIVAL

HDFS

Hadoop gets asymmetricbut I thought we were taking the work to the data…

B

App

L1 L1 L1

Isolate

A A A

nodes

labels

HotAll replicas on DISK

Warm1 replica on DISK, others on

ARCHIVE

ColdAll replicas on

ARCHIVE

Hadoop cluster

DIS

K

DIS

K

DIS

K

DIS

K

DIS

K

DIS

K

DIS

K

DIS

K

DIS

K

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

AR

CH

IVE

Yarn Labels

Allows applications running

in yarn containers to be

constrained to designated

nodes in the cluster

HDFS Tiering

Allows the creation of pools of

storage for SSD, HDD and

Archive, RAM-disk, leveraging

different server configurations

What about Data Locality ?

New complementary approach to address Big Data demands

Storage Optimized Servers

Benefits of HPE Big Data Reference ArchitectureHPE Moonshot and Apollo servers address a variety of enterprise big data needs

Cluster consolidationMultiple big data environments can directly access a shared pool of data

Flexibility to scaleScale compute and storage independently

Maximum elasticityRapidly provision compute without

affecting storage

Breakthrough economicsSignificantly better density, cost and

power through workload optimized

components

DFSIO testing on Big Data Reference architecture Better numbers with optimized IO Servers for HDFS

HPE Big Data Reference ArchitectureHadoop and its ecosystem take advantage of the BDRA

17

Ethernet

Network SwitchesEast - West Networking

Impala

HPE Hadoop Traditional vs HPE Big Data Reference Architecture

2X Hadoop MapReduce performance with the same footprint

2.5X HBase performance with the same footprint

Note: Comparison configuration is ProLiant DL380 Gen9 servers

2 x Higher Density

2.4 x Memory Density

46% Less Power (Watts)

Traditional

architecture

Big Data

Reference

Architecture

versus

1.5PB configuration exampleComparable Hadoop performance and raw compute (SpecInt) power

Compared to 2U rackmount BDRA

Acquisition cost 3% lower

Power 54% lower

Density (total rack U) 2x density

5 year power/cooling savings (assume $.20/kWh) $472K

HOT COLD

Independent scaling of compute and storage[ HPE ProLiant DL380 Gen9 ] vs [ HPE Moonshot for Computing + HPE ProLiant Apollo 4200 for Storage ]

HPE Big Data Reference ArchitectureTraditional

Architecture

2.8x compute

97% of the storage capacity

4x the memory

1.6x compute

1.5x the storage capacity

2.5x the memory

90% of the compute

2.1x the storage capacity

1.5x the memory

HPE BDRA Components

24

Hadoop performance density > 2 times better - Power consumption = 0.5

HPE Big Data Reference ArchitectureScale-Out Building blocks

HPE Apollo

Scalable System

Storage optimized servers

Cost-effective industry

standard storage server

purpose built for big data with

converged infrastructure that

offers high density energy-

efficient storage

HPE Network Switches

East – West Networking

HPE Moonshot System

with 45 x m710 Compute nodes

HPE Apollo 2200

with 4 x XL170r Gen9 High Compute nodes

Compute optimizedservers

Front

Rear

HPE Moonshot 1500

28

2 internal switches

45 hot-plug cartridges

• 1-node = 45 servers in a chassis

• 4-nodes =180 servers in a chassis

• HP Moonshot-45G (45 x1Gb port)

• HP Moonshot-180G (180 x1Gb port)

• HP Moonshot-45XG (45 x10Gb port)

Web-cache

64-bit ARM

m400

Remote PCs

XenDesktop

m700

Big Data, Hadoop

Video transcodingm710p

Real-time analyticsTelecom, finance

m800

Web-hosting

180 servers in 4.3U

m350

Full WEB-infrastructure in

a single chassisDedicated hosting

m300

45 Hadoop Low-power Hadoop compute nodes per enclosure !

Big Data Compute Node

Big data Storage NodeHPE Apollo 4200 - Bringing Big Data storage server density to enterprise

Big data Storage Node for Backup or ArchivalHPE Apollo 4510 - Very High density Big Data storage server

Scalable density

Lower TCO

Workload optimized

Rack-scale storage server densityUp to 5.44 PB in 42U rack

Rack-scale extreme density – 5.44 PB per Rack!

Cost effective

68 LFF HDDs/SSDs in 4U server chassis for low-cost, power & space efficient

solutions

Configuration flexibilityBalance capacity, cost and throughput with flexible

options for disks, CPUs , I/O and interconnects

HPE BDRA in a Virtualized contextUsage example

33

HPE BDRA used for multi-tenancy or Hadoop as a Service

Multi-tenancy or Hadoop as a service, are made easier when separating the

data processing service and the storage management service as it brings

Often based on a Virtualized environment

– Better workload isolation between YARN applications

– More flexibility by scaling compute and storage independently

– Full elasticity on the computing side

– Rapidly provision and decommission compute without affecting storage

VM

DK

HPE BDRA used in a fully Elastic Virtualized environment

Compute and Storage nodes are virtualized in a different manner

363PAR F400

3PAR F400

3PAR F400

VM

DK

VM

DK

Ext4

Ext4

Ext4

Hadoop DataNode

Virtualization Hosts

3PAR F400

3PAR F400

3PAR F400

3PAR F400

3PAR F400

3PAR F400

Hadoop C

om

pu

te N

ode

Hadoop C

om

pu

te N

ode

Hadoop C

om

pu

te N

ode

Hadoop C

om

pu

te N

ode

VM

DK

Ext4

Host

VM

BD

RA

Sto

rag

e N

od

e

BD

RA

Co

mp

ute

No

des

Summarizing &HPE Big Data Architecture long term view

37

HPE Big Data Reference Architecture

– The HPE BDRA is a complementary Hadoop reference Architecture that brings

• Elasticity extreme elasticity brought to Hadoop

• Flexibility adaptive architecture that makes IT more responsive

• Efficiency scale compute and storage independantly

– It takes advantage of new Hadoop trends and features like

• Hadoop YARN Labels

• Hadoop HDFS Tiering

– The target customers are

• Mature Hadoop customers who want to consolidate clusters

• People who need virtualization, multi-tenancy, Elasticity or want to build a smart Data Lake

• People who want to optimize the density and the power consumption(breakthrough economics)

– The BDRA works with fully standard Hadoop stacks (no patches, not proprietary)

• Cloudera Enterprise 5

• Hortonworks Data Platform 2

• MapR M5

HPE BDRA Optimized Compute & Storage nodes

Support multiple compute and storage blocks

Converged Infrastructure benefits for Big DataHadoop Node Labels feature (jira YARN-796)

• Combined with the HPE Big Data Reference Architecture, compute nodes

can be dynamically assigned as there is no need for data repartitioning

• HPE contributed IP into the Hadoop trunk, working with Hortonworks

• Allows scheduling of YARN containers to specific pools of nodes

HPE BDRA CI for Big Data long term viewEvolve to support multiple compute and storage blocks

Multi-temperate Storage using HDFS Tiering and ObjectStores

Workload Optimized compute nodes to accelerate various big data software

Thank you !

Learn more on how your organization can benefit from

HPE Big Data Reference ArchitectureHPE Big Data Reference Architecture: Overview

HPE Big Data Reference Architecture: Hortonworks implementation

HPE Big Data Reference Architecture: Cloudera implementation

HPE Big Data Reference Architecture: MapR implementation

Running HBase on the HPE Big Data Reference Architecture

http://www.hpe.com/go/hadoop

http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA5-6141ENW



http://www8.hp.com/h20195/V2/GetDocument.aspx?docname=4AA5-7447ENW

http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=4AA5-8757ENW

key trends in big data and new reference architecture from hewlett packard enterprise / gilles...

Engineering