key trends in big data and new reference architecture from hewlett packard enterprise / gilles...
TRANSCRIPT
A modern, flexible approach to Hadoop implementationHPE Big Data Reference Architecture
Gilles NoisetteHPE EMEA Big Data Center Of Excellence
November 2015
Agenda
• Big Data IT infrastructure trends
• Hadoop Evolution & Architecture trends
• Hadoop YARN Labelling
• Hadoop Storage Tiering
• New HPE Architecture approach to Big Data
• HPE Big Data Reference Architecture
• Scaling Hadoop more efficiently
• HPE BDRA Components
• HPE BDRA in a virtualized context
• HPE Big Data Architecture long term view
IT infrastructures must evolve to handle Big Data demands
• Multiple silos with multiple copies of the same data
• Difficult to standardize on a consistent server architecture
• Less elastic than other virtualized or converged infrastructure
• Large scale makes density, cost and power problematic
Challenges
The Analytic Cycle
The Pace of Change
The Pace of Change
And how people are buying Hadoop is changing also….
Hadoop YARN LabellingRunning applications on particular set of nodes
YARN Labelling (Node-labels / Hadoop 2.6 / jira YARN-796)
Capability to create groups of similar nodes to run different types of applications with different workload, each, on the most appropriate group of node
• Admin tags nodes with labels (e.g.: GPU, Storm)
− One node can have more than one label (e.g.: GPU, m710)
• Applications can include labels in container requests
Enabling the next Generation of Hadoop Applications . . .
NodeManager
[Storm]
Application
Master
I want a GPU
NodeManager
[GPU, m710]
HPE Moonshot cartridge
NodeManager
[Analytic, XL170r]
HPE Apollo blades
YARN Labels are used in productionYARN Labelling case studies
Vinod Vavilapalli – @Tshooter
Yahoo! uses machines with GPUs on #Hadoop clusters (#YARN) to model
'beautiful' images on Flickr. #hadoopsummit
1:43 AM - 16 Apr 2015
Vinod Vavilapalli – @Tshooter
.@pcnudde talking about #Yahoo using custom #Hadoop #YARN apps together
with Node labels / High CPU machines for learning. #hadoopsummit
1:49 AM - 16 Apr 2015
Yahoo uses YARN labels
eBay cluster use YARN labels to
• Separate Machine Learning workloads from regular workloads
• Separate licensed software to some machines
• Enable GPU workloads
• Separate organizational workloadsMayank Bansal, ebay
Hadoop Storage tieringHadoop Architecture trends
HDFS Tiering / Heterogeneous Storage Tiers (HDFS-2832)
Allows a single cluster to have multiple storage tiers such as ARCHIVE, DISK, SSD, RAM-disk.
Awareness of storage media allow HDFS to make better decisions about the placement of block data with input from applications. Distribution of replicas could be based on its performance and durability requirements.
• Phase2:
–HDFS-5682 - Application APIs for heterogeneous storage
–HDFS-7228 - SSD storage tier
–HDFS-5851 - Memory as a storage tier
HDFS Archival Storage Design (HDFS-6584)– Introduces a new concept of storage policies. For accommodating future storage
technology and different cluster characteristics, cluster administrators will be able to
modify the predefined storage policies and/or define custom storage policies.
– Data policy names : Very Hot Hot Warm Luke Warm Cold
Ebay use Tiered Storage for its Hadoop clusterHDFS Tiering case study
40 PB / 2000 nodes cluster was getting full
HDFS Tiering features
• Data reside on same cluster in a standard HDFS
• Data could easily move back and forth, to and from, the Archive
• Tiered storage is operated using storage types and storage policies
• Archival policy is based on access pattern
– Antony Benoy, ebay
40 PB / 2000 nodes
DISK
10 PB / 48 nodes
ARCHIVAL
HDFS
Hadoop gets asymmetricbut I thought we were taking the work to the data…
B
App
L1 L1 L1
Isolate
A A A
nodes
labels
HotAll replicas on DISK
Warm1 replica on DISK, others on
ARCHIVE
ColdAll replicas on
ARCHIVE
Hadoop cluster
DIS
K
DIS
K
DIS
K
DIS
K
DIS
K
DIS
K
DIS
K
DIS
K
DIS
K
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
Yarn Labels
Allows applications running
in yarn containers to be
constrained to designated
nodes in the cluster
HDFS Tiering
Allows the creation of pools of
storage for SSD, HDD and
Archive, RAM-disk, leveraging
different server configurations
What about Data Locality ?
New complementary approach to address Big Data demands
Storage Optimized Servers
Benefits of HPE Big Data Reference ArchitectureHPE Moonshot and Apollo servers address a variety of enterprise big data needs
Cluster consolidationMultiple big data environments can directly access a shared pool of data
Flexibility to scaleScale compute and storage independently
Maximum elasticityRapidly provision compute without
affecting storage
Breakthrough economicsSignificantly better density, cost and
power through workload optimized
components
DFSIO testing on Big Data Reference architecture Better numbers with optimized IO Servers for HDFS
HPE Big Data Reference ArchitectureHadoop and its ecosystem take advantage of the BDRA
17
Ethernet
Network SwitchesEast - West Networking
Impala
HPE Hadoop Traditional vs HPE Big Data Reference Architecture
2X Hadoop MapReduce performance with the same footprint
2.5X HBase performance with the same footprint
Note: Comparison configuration is ProLiant DL380 Gen9 servers
2 x Higher Density
2.4 x Memory Density
46% Less Power (Watts)
Traditional
architecture
Big Data
Reference
Architecture
versus
1.5PB configuration exampleComparable Hadoop performance and raw compute (SpecInt) power
Compared to 2U rackmount BDRA
Acquisition cost 3% lower
Power 54% lower
Density (total rack U) 2x density
5 year power/cooling savings (assume $.20/kWh) $472K
HOT COLD
Independent scaling of compute and storage[ HPE ProLiant DL380 Gen9 ] vs [ HPE Moonshot for Computing + HPE ProLiant Apollo 4200 for Storage ]
HPE Big Data Reference ArchitectureTraditional
Architecture
2.8x compute
97% of the storage capacity
4x the memory
1.6x compute
1.5x the storage capacity
2.5x the memory
90% of the compute
2.1x the storage capacity
1.5x the memory
HPE BDRA Components
24
Hadoop performance density > 2 times better - Power consumption = 0.5
HPE Big Data Reference ArchitectureScale-Out Building blocks
HPE Apollo
Scalable System
Storage optimized servers
Cost-effective industry
standard storage server
purpose built for big data with
converged infrastructure that
offers high density energy-
efficient storage
HPE Network Switches
East – West Networking
HPE Moonshot System
with 45 x m710 Compute nodes
HPE Apollo 2200
with 4 x XL170r Gen9 High Compute nodes
Compute optimizedservers
Front
Rear
HPE Moonshot 1500
28
2 internal switches
45 hot-plug cartridges
• 1-node = 45 servers in a chassis
• 4-nodes =180 servers in a chassis
• HP Moonshot-45G (45 x1Gb port)
• HP Moonshot-180G (180 x1Gb port)
• HP Moonshot-45XG (45 x10Gb port)
Web-cache
64-bit ARM
m400
Remote PCs
XenDesktop
m700
Big Data, Hadoop
Video transcodingm710p
Real-time analyticsTelecom, finance
m800
Web-hosting
180 servers in 4.3U
m350
Full WEB-infrastructure in
a single chassisDedicated hosting
m300
45 Hadoop Low-power Hadoop compute nodes per enclosure !
Big Data Compute Node
Big data Storage NodeHPE Apollo 4200 - Bringing Big Data storage server density to enterprise
Big data Storage Node for Backup or ArchivalHPE Apollo 4510 - Very High density Big Data storage server
Scalable density
Lower TCO
Workload optimized
Rack-scale storage server densityUp to 5.44 PB in 42U rack
Rack-scale extreme density – 5.44 PB per Rack!
Cost effective
68 LFF HDDs/SSDs in 4U server chassis for low-cost, power & space efficient
solutions
Configuration flexibilityBalance capacity, cost and throughput with flexible
options for disks, CPUs , I/O and interconnects
HPE BDRA in a Virtualized contextUsage example
33
HPE BDRA used for multi-tenancy or Hadoop as a Service
Multi-tenancy or Hadoop as a service, are made easier when separating the
data processing service and the storage management service as it brings
Often based on a Virtualized environment
– Better workload isolation between YARN applications
– More flexibility by scaling compute and storage independently
– Full elasticity on the computing side
– Rapidly provision and decommission compute without affecting storage
VM
DK
HPE BDRA used in a fully Elastic Virtualized environment
Compute and Storage nodes are virtualized in a different manner
363PAR F400
3PAR F400
3PAR F400
VM
DK
VM
DK
Ext4
Ext4
Ext4
Hadoop DataNode
Virtualization Hosts
3PAR F400
3PAR F400
3PAR F400
3PAR F400
3PAR F400
3PAR F400
Hadoop C
om
pu
te N
ode
Hadoop C
om
pu
te N
ode
Hadoop C
om
pu
te N
ode
Hadoop C
om
pu
te N
ode
VM
DK
Ext4
Host
VM
BD
RA
Sto
rag
e N
od
e
BD
RA
Co
mp
ute
No
des
Summarizing &HPE Big Data Architecture long term view
37
HPE Big Data Reference Architecture
– The HPE BDRA is a complementary Hadoop reference Architecture that brings
• Elasticity extreme elasticity brought to Hadoop
• Flexibility adaptive architecture that makes IT more responsive
• Efficiency scale compute and storage independantly
– It takes advantage of new Hadoop trends and features like
• Hadoop YARN Labels
• Hadoop HDFS Tiering
– The target customers are
• Mature Hadoop customers who want to consolidate clusters
• People who need virtualization, multi-tenancy, Elasticity or want to build a smart Data Lake
• People who want to optimize the density and the power consumption(breakthrough economics)
– The BDRA works with fully standard Hadoop stacks (no patches, not proprietary)
• Cloudera Enterprise 5
• Hortonworks Data Platform 2
• MapR M5
HPE BDRA Optimized Compute & Storage nodes
Support multiple compute and storage blocks
Converged Infrastructure benefits for Big DataHadoop Node Labels feature (jira YARN-796)
• Combined with the HPE Big Data Reference Architecture, compute nodes
can be dynamically assigned as there is no need for data repartitioning
• HPE contributed IP into the Hadoop trunk, working with Hortonworks
• Allows scheduling of YARN containers to specific pools of nodes
HPE BDRA CI for Big Data long term viewEvolve to support multiple compute and storage blocks
Multi-temperate Storage using HDFS Tiering and ObjectStores
Workload Optimized compute nodes to accelerate various big data software
Thank you !
Learn more on how your organization can benefit from
HPE Big Data Reference ArchitectureHPE Big Data Reference Architecture: Overview
HPE Big Data Reference Architecture: Hortonworks implementation
HPE Big Data Reference Architecture: Cloudera implementation
HPE Big Data Reference Architecture: MapR implementation
Running HBase on the HPE Big Data Reference Architecture
http://www.hpe.com/go/hadoop