System Software Considerations for Cloud Computing on Big Data
Michael Kozuch
Intel Labs Pittsburgh
March 17, 2011
Outline
1. Background: Open Cirrus
2. Cluster software stack
3. Big Data
4. Power
5. Recent news
2
Open Cirrus
Open Cirrus* Cloud Computing Testbed
MIMOS*
ETRI*
ISPRAS*KIT*UIUC*
IDA*
Sponsored by HP, Intel, and Yahoo! (with additional support from NSF)14 sites currently, target of around 20 in the next two years
Collaboration between industry and academia, sharing•hardware infrastructure
•software infrastructure•research •applications and data sets
CMU*China Mobile*
China Telecom*CESGA*
GaTech*
Open Cirrus*
Objectives–Foster systems research around cloud computing
–Enable federation of heterogeneous datacenters
–Vendor-neutral open-source stacks and APIs for the cloud
–Expose research community to enterprise level requirements
–Capture realistic traces of cloud workloads
Each Site–Runs its own research and technical teams,
–Contributes individual technologies
–Operates some of the global servicesIndependently-managed sites… providing a cooperative research testbed
http://opencirrus.org
1 Gb/s (x8 p2p)Intel BigData Cluster
45 Mb/s T3 to Internet
3U Rack5 storage
nodes-------------
12 1TB Disk
1 Gb/s (x2x5 p2p)
x3(r1r3,r1r4,r2r3)
20 nodes: 1 Xeon (single-core) [Irwindale]6GB DRAM366GB disk10 nodes: 2 Xeon 5160 (dual-core) [Woodcrest]4GB RAM 2 75GB Disk10 nodes: 2 Xeon E5345 (quad-core) [Clovertown]8GB DRAM2 150GB Disk
x1(r1r1)
Switch48 Gb/s
x2(r3r2,r3r3)
1 Gb/s (x4x4 p2p)
Blade Rack
40 nodes-------------
2 Xeon E5345(quad-core)[Clovertown]8GB DRAM
2 150GB Disk
Switch48 Gb/s
1 Gb/s (x4x4 p2p)
Blade Rack
40 nodes
Switch48 Gb/s
1 Gb/s (x15 p2p)
1U Rack 15 nodes-------------
2 Xeon E5420(quad-core)
[Harpertown]8GB DRAM2 1TB Disk
Switch48 Gb/s
1 Gb/s (x15 p2p)
2U Rack 15 nodes-------------
2 Xeon E5440(quad-core)
[Harpertown]8GB DRAM6 1TB Disk
Switch48 Gb/s
1 Gb/s (x15 p2p)
2U Rack 15 nodes-------------
2 Xeon E5520(quad-core)
[Nehalem-EP] 16GB DRAM6 1TB Disk
Switch48 Gb/s
Mobile Rack 8 (1u) nodes
-------------2 Xeon E5440(quad-core)
[Harpertown] 16GB DRAM2 1TB Disk
Switch24 Gb/s
Switch48 Gb/s
1 Gb/s (x8)
1 Gb/s (x4)
1 Gb/s (x4)
1 Gb/s (x4)
1 Gb/s (x4)1 Gb/s
(x4)1 Gb/s
(x4)1 Gb/s
(x4)
(r2r2c1-4)(r2r1c1-4)
r2r1c1-4 r2r2c1-4 r1r1 r1r2r1r3 r1r4
r2r3 r3r2 r3r3 mobile storage TOTALNodes 40 40 15 27 45 30 8 5 210Cores 140 320 120 264 360 240 64 1508DRAM (GB) 240 320 120 696 360 480 128 2344Spindles 80 80 30 102 270 180 16 60 818Storage (TB) 12 12 30 66 270 180 16 60 646
(r1r5)
Key:rXrY=row X rack YrXrYcZ=row X rack Y chassis Z
x1(r1r2)
12 nodes-------------
2 Xeon X5650(six-core)
[WestmereEP]48GB DRAM6 0.5TB Disk
Switch48 Gb/s
1U Rack 15 nodes-------------
2 Xeon E5420(quad-core)
[Harpertown]8GB DRAM2 1TB Disk
1 Gb/s (x27 p2p)
1 Gb/s (x4)
Cloud Software Stack
Cloud Software Stack – Key Learnings
• Enable use of application frameworks (Hadoop, Maui-Torque)
• Enable general IaaS use
• Provide Big Data storage service
• Enable physical resources allocation
Resource Allocator
IaaS
Storage Service
Application Frameworks
Why Physical?1.Virtualization overhead2.Access to phys resource
3.Security issues
Zoni Functionality
• Allocation• Assignment of physical resources to users
• Isolation• Allow multiple mini-clusters to co-exist without interference
• Provisioning• Booting of specified OS
• Management• OOB power management
• Debugging• OOB console access
Server Pool 0
PXE/DNS/DHCP
Domain 0
Server Pool 1
Server
Pool 0
DNS/PXE/DHCP
Domain 1
Gateway
Provides each project with a mini-datacenter
Isolation of experiments
Intel BigData Cluster Dashboard
Big Data
12
Example Applications
Application Big Data Algorithms Compute Style
Scientific study (e.g. earthquake study)
Ground model Earthquake simulation, thermal conduction, …
HPC
Internet library search
Historic web snapshots
Data mining MapReduce
Virtual world analysis
Virtual world database
Data mining TBD
Language translation
Text corpuses, audio archives,…
Speech recognition, machine translation, text-to-speech, …
MapReduce & HPC
Video search Video data Object/gesture identification, face recognition, …
MapReduce
There has been more video uploaded to YouTube in the last 2 months than if ABC, NBC,
and CBS had been airing content 24/7/365 continuously since 1948. - Gartner
13
Big Data
Interesting applications are data hungry
The data grows over time
The data is immobile– 100 TB @ 1Gbps ~= 10 days
Compute comes to the data
Big Data clusters are the new librariesThe value of a cluster is its data
14
Example Motivating Application:Online Processing of Archival Video• Research project: Develop a context recognition system that is 90% accurate
over 90% of your day• Leverage a combination of low- and high-rate sensing for perception• Federate many sensors for improved perception• Big Data: Terabytes of archived video from many egocentric cameras
• Example query 1: “Where did I leave my briefcase?”• Sequential search through all video streams [Parallel Camera]
• Example query 2: “Now that I’ve found my briefcase, track it”• Cross-cutting search among related video streams [Parallel Time]
14
Big Data Cluster
Big Data System Requirements
Provide high-performance execution over Big Data repositories
Many spindles, many CPUs
Parallel processing
Enable multiple services to access a repository concurrently
Enable low-latency scaling of services
Enable each service to leverage its own software stack
IaaS, file-system protections where needed
Enable slow resource scaling for growth
Enable rapid resource scaling for power/demand
Scaling-aware storage
15
16
Storing the Data – Choices
computeservers
storageservers
Model 1: Separate Compute/Storage
compute/storageservers
Model 2: Co-located Compute/Storage
Compute and storage can scale independently
Many opportunities for reliability
No compute resources are under-utilized
Potential for higher throughput
Cluster Model
TOR Switch
Rack of N server
nodes
Connections to R Racks
BWswitch
BWdisk
BWnode
p cores d disks
Cluster Switch
external network
17
The cluster switch quicklybecomes the bottleneck.
Local computation is crucial.
I/O Throughput Analysis
0
1000
2000
3000
4000
5000
6000
Disk-1G SSD-1G Disk-10G SSD-10G
Dat
a T
hrou
ghpu
t (G
b/s)
Random PlacementLocation-Aware Placement
3.6X
11X
3.5X
9.2X
20 racks of 20 2-disk servers; BWswitch = 10 Gbps18
Data Location Information
Issues:
• Many different file system possibilities (HDFS, PVFS, Lustre, etc)
• Many different application framework possibilities
• Consumers could be virtualized
Solution:
• Standard cluster-wide Data Location Service
• Resource Telemetry Service to evaluate scheduling choices
• Enables virtualized location info and file system agnosticism
19
Exposing Location Information
OS
DataLocationService
ResourceTelemetry
ServiceDFS
LA runtime
LA application
(a) non-virtualized
OS
DFSVMM
(b) virtualized
Guest OS
DFS
LA runtime
LA application
Virtu
al Mach
ines
VM Runtime
20
Power
“A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems,” Anton Beloglazov, Rajkumar Buyya, Young Choon Lee, and Albert Zomaya
22
(System) EfficiencyDemand Scaling/
Power Proportionality
Power Proportionality and Big Data
Nu
mb
er o
f b
lock
s s
tore
d o
n n
od
e i
Node number i i=100
2000Possible power savings: ~0%~66%
Optimal: ~95%
The Hadoop Filesystem(10K blocks)
Rabbit Filesystem
24
A reliable, power-proportional
filesystem for Big Dataworkloads
Simple Strategy: Maintain a “primary replica”
Recent News
Recent News
• “Intel Labs to Invest $100 Million in U.S. University Research”• Over five years• Intel Science and Technology Centers– 3+2 year sponsored research• Half-dozen or more by 2012• Each can have small number of Intel research staff on site
• New ISTC focusing on cloud computing possible
26
Tentative Research Agenda Framing
Potential Questions
Potential Research Questions
Software stack
• Is physical allocation an interesting paradigm for the public cloud?
• What are the right interfaces between the layers?
• Can multi-variable optimization work across layers?
Big Data
• Can a hybrid cloud-HPC file system provide best-of-both-worlds?
• How should the file system deal with heterogeneity?
• What are the right file system sharing models for the cloud?
• Can physical resources be taken from the FS and given back?
29
Potential Research Questions
Power
• Can storage service power be reduced without reducing availability?
• How should a power-proportional FS maintain a good data layout?
Federation
• Which applications can cope with limited bandwidth between sites?
• What are the optimal ways to join data across clusters?
• How necessary is federation?
30
How should compute, storage, and powerbe managed to optimize for
performance, energy, and fault-tolerance?
Backup
Scaling– Power Proportionality
Demand scaling presents perf./power trade-off
• Our servers: 250W loaded, 150W idle, 10W off, 200s setup
Research underway for scaling cloud applications
• Control theory
• Load prediction
• Autoscaling
Scaling beyond single tier less well-understood
Request rate: λ
Cloud-based App
Note: proportionality issue is orthogonal to FAWN design
Scaling– Power Proportionality
Project 1: Multi-tier power management
• E.g. Facebook
Project 2: Multi-variable optimization
Project 3: Collective optimization
• Open Cirrus may have key role
λ
λ
Physical resources
IaaSDistributed file system
Resource allocatore.g. Rabbite.g. Tashi
e.g. Zoni