the cloud & its impact on it

26
© 2013 IBM Corporation December 16 th 2014 Anand Haridass Senior Technical Staff Member STG Strategy IBM India The Cloud & Its Impact On High Performance Computing

Upload: anand-haridass

Post on 16-Apr-2017

366 views

Category:

Devices & Hardware


0 download

TRANSCRIPT

Page 1: The Cloud & Its Impact on IT

© 2013 IBM Corporation

December 16th 2014

Anand Haridass

Senior Technical Staff Member

STG Strategy

IBM India

The Cloud & Its Impact On High Performance Computing

Page 2: The Cloud & Its Impact on IT

© 20142

Overview

� What is Cloud Computing ?

� Scale-up vs. Scale-out

� Open Hardware

� Cloud Data Centers

� Power

� Bandwidth

� Disaggregation

Page 3: The Cloud & Its Impact on IT

© 20143

Page 4: The Cloud & Its Impact on IT

© 2014

What is Cloud Computing ?

CharacteristicsOn-demand self-serviceBroad network accessResource poolingRapid elasticityMeasured service

4 Source: NIST Special Publication 800-146 Cloud Computing Synopsis and Recommendations

"Cloud computing is a model for enabling convenient, on-demand network accessto a shared pool of configurable computing resources (e.g., networks, servers,storage, applications, & services) that can be rapidly provisioned and releasedwith minimal management effort or service provider interaction.”

ServicesInfrastructure as a Service (IaaS)

Platform as a Service (PaaS)Software as a Service (SaaS)

Development Models Private CloudPublic CloudHybrid Cloud

Page 5: The Cloud & Its Impact on IT

© 20145

Standardization; OPEX savings; faster time to value

Networking

Storage

Servers

Virtualization

O/S

Middleware

Runtime

Data

Applications

Traditional On-Premises

Networking

Storage

Servers

Virtualization

O/S

Middleware

Runtime

Data

Applications

Platformas a Service

Networking

Storage

Servers

Virtualization

O/S

Middleware

Runtime

Data

Applications

Softwareas a Service

Networking

Storage

Servers

Virtualization

Middleware

Runtime

Data

Applications

Infrastructureas a Service

O/S

Vendor Manages in CloudClient Manages

Customization; higher cost; slower time to value

On Premise vs. IaaS vs. PaaS vs. SaaS

Page 6: The Cloud & Its Impact on IT

© 2014

“Cloud” Services are here to stay

6Source: *GMV 2H13

Cloud Revenue opportunity 2017 ~$57 Billion (~$30Billion ’14) @ 24% CAGR*

Page 7: The Cloud & Its Impact on IT

© 2014

Traditional vs. “Cloudified” Hardware

� “Scale-Up” � Symmetrical Multiprocessing Systems� Large shared memory machines� Expensive to scale beyond a certain

size� 4 / 8 / 16 / 32 sockets� 4U/10U/Rack Sized Systems

� “Scale-Out”� Loosely coupled systems� “Infinite” Scale� Mostly 1 & 2 sockets� 1U / 2U Form Factor (0.33/0.5/1 wide)

7

Scale Out

Sca

le U

p

Mainframes

Towers Servers

Easy to Program Hard to Scale beyond

Standard RackServers

Easy to Scale Hard to Program

SMP’s

Optimized RackServers

Blade Servers

Loosely

Coupled

Systems

� Significant changes in programming & application paradigms � Hadoop/HDFS / NoSQL DB’s….

� Open source software community driven � Linux / OpenStack …

Page 8: The Cloud & Its Impact on IT

© 2014

What this has translated to ..

� Lots of ODM/OEM Vendors�Build with ‘cheap’ commodity hardware over ‘exotic’ hardware

� ‘Inexpensive’ commodity components failure is a way of life�Redundancy (N+1) moved out of hardware ($) �Software stack needs to plan for failure �Significant work on systems resilient to storage failures

� Example *�Disk drives – 4 to 6% annual failure rate (AFR) � 5% AFT MTB of �Servers – 2 to 4% � 3% AFR translates to MTBF of 292K hrs (33yrs)� In a Datacenter with 64000 servers w/ 2 disks

� 5 servers & 17 disks fail daily !

8

Stringent SLA requirements � Applications should continue to function even if the underlying

physical hardware fails / is removed / replaced Envision a fail (Everything WILL Fail At Scale)

� work backwards Source: *James Hamilton, Amazon

Page 9: The Cloud & Its Impact on IT

© 2014

What this has translated to … Evolving

9

Accelerating examples:•Google Spanner, Omega, …•Microsoft Bing, Azure Storage•Various Amazon S3, Glacier, …•Alibaba ecommerce services•IBM Watson

Accelerating examples:•Google Spanner, Omega, …•Microsoft Bing, Azure Storage•Various Amazon S3, Glacier, …•Alibaba ecommerce services•IBM Watson

� The pendulum swinging the other way……

� Cloud vendors drive huge volumes (100K+ to a few million servers)

� Now seeing ‘customization’ for different workloads

� Accelerators / Flash / SSD’s

� More aggressive hardware – software co-optimization

� Open ‘Hardware’

Google SpannerMicrosoft BingCatapult

Page 10: The Cloud & Its Impact on IT

© 2014

Open Compute Project

10

http://opencompute.org/

Page 11: The Cloud & Its Impact on IT

© 201411

The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise, investment, and

server-class intellectual property to serve the evolving needs of customers.

OpenPOWER

Opening the architecture to give the industry the ability to innovate across the full Hardware and Software stack

• Simplify system design with alternative architecture

• Includes SOC design, Bus Specifications, Reference Designs, FW OS and Open Source Hypervisor

• Little Endian Linux to ease the migration of software to POWER

Driving an expansion of enterprise class Hardware and Software stack for the data center

Building a complete ecosystem to provide customers with the flexibility to build servers best suited to the Power architecture

Page 12: The Cloud & Its Impact on IT

© 201412

Growing Community

Boards / Systems

I/O / Storage / Acceleration

Chip / SOC

System / Software / Services

Implementation / HPC / Research

Complete member list at www.openpowerfoundation.org

Oct/2014

Page 13: The Cloud & Its Impact on IT

© 201413

Ecosystem Enablement

XCATXCAT

System Operating Environment Software Stack

A modern development environment is emerging based on tools and services

CloudSoftware

OperatingSystem / KVM

Standard OperatingEnvironment

(System Mgmt)

So

ftwa

re

Power Open Source Software Stack Components

ExistingOpen

Source Software

Communities

Firmware

Hardware

New OSS Community

OpenPOWERTechnology

OpenPOWERFirmware

CAPP

PC

Ie

POWER8

CAPI over PCIe

“Standard POWER Products” – 2014

Ha

rdw

are

“Custom POWER SoC” – Future

Customizable

Framework to Integrate System IP on Chip

Industry IP License Model

Multiple Options to Design with POWER Technology Within OpenPOWER

44,000 packages now available

Page 14: The Cloud & Its Impact on IT

© 2014

OpenPOWER – Find out more !

www.tyan.com/campaign/openpower/openpowerfoundation.org/

The Google reference board� two single-chip module (SCM)� four modified SATA ports� Google use only

http://www.enterprisetech.com/2014/04/28/inside-google-tyan-power8-server-boards/

Page 15: The Cloud & Its Impact on IT

© 2014

Cloud Scale Data Center

Component Sub-Components

Servers* CPU, memory, disk

Infrastructure* UPS, Cooling, Power Distribution

Power (W) Electric Utility Costs

Networking Switches/Links/Transit

15

* 3 yr amortization on Servers & 15yrs on infrastructure

Investing in a Datacenter (250million - 1Billion+) High scale� 10’s to 100’s of thousands of

serversGeo-distribution� 10s to 100s of DCsStringent availability & performance requirements� 99.9th percentile SLAs� Cost per transaction / cost per

data “unit”Complexity� Lots of components: Load

Balancers, operating system, middleware, virtualization/containers, switches, servers, racks …

Page 16: The Cloud & Its Impact on IT

© 201416

Let’s drill down on Power (Watts)� Datacenter power usage

� Total global data center power use ~320 TWHr (Data Center Dynamics Focus, Nov 2012)

� Total data center global electricity use 1.8%

� PUE is defined as the ratio of total facilities energy to IT equipment energy(Not a perfect metric – but that’s a discussion for another day *)

� Datacenter efficiency – Average DC efficiency with PUE over 2.0 (Source: EPA) � Lots of High-end cloud services in 1.2 to 1.5 range� Lowers computing cost & better for environment � Corporate Responsibility � Best of Breed < 1.1 (1.05 …1.08)

� Outside air cooled (no chillers) � seeing DC’s in low temp belts� Sea / River water cooled

* The Green Grid White paper #49 – “PUE : A comprehensive examination of the metric”

� Baseline Server Failure rate = 2%� 40C inlet (cold aisle temperature)

degrades by 1.65x � Enhanced failure rate due to

temperature = 3.3 � More economical to run hotter ??

Source: Published by ASHRAE TC 9.9 ; Roger Schmidt, IBM

Page 17: The Cloud & Its Impact on IT

© 201417 Source: http://mvdirona.com/jrh/talksAndPapers/JamesHamilton_IntelDCSGg.pdf

Page 18: The Cloud & Its Impact on IT

© 201418

Rack Level PowerData Center IT Power Trends: Maximum kW/Rack(survey)

� Water Advantages� Order of magnitude lower unit

thermal resistance� 3500X heat carrying capacity� Lower temperature

� Lower power (less leakage)� Better reliability

� Water Disadvantages� Added complexity� Added cost (but not necessarily

cost/performance)� The perception of water cooling

Water flow onWater flow off

Rear Door Heat Exchanger

� Eliminate rack heat exhaust

� Same dimensions as standard rear door 4” deep

� Liquid cooling at the rack is 75%-95% more efficient than air cooling by a CRAC

� No electrical or moving parts

� No condensation

Source: Roger Schmidt, IBM

Page 19: The Cloud & Its Impact on IT

© 201419

Server Level Power

� Maximize Utilization� Typical datacenter utilization ~10-15%� Virtualization / Containers � get it to 80-

90% (depending on performance SLA)

� Fans/Blowers (~20-30% of Server Power)

� Understand Regulator efficiency (op. zone)

� Energy Proportional Computing� Idle Power % of Max Power ?

� Performance/Watt metrics

Fan-based Power Optimization

Increasing Microprocessor Temperature (Decreasing fan speed)

Incr

ea

sin

g P

ow

er

Co

nsu

mp

tio

n

Microprocessor leakage power

Fan power

Total power

Source: Ryan Waite, Microsoft

Page 20: The Cloud & Its Impact on IT

© 201420

Processor Level Power

� Very Aggressive DVFS � Transients / switching currents (typically over predicted) – inductive noise� Model / validate accurately � ‘Guard-band’ voltages margins – power wastage� Emergency brakes (CPM)

� Rapid Power / Thermal Cycling� CTE mismatch Silicon Organic packages� Electron migration effects / Reliability

� Circuit Techniques - Resonant Clocking � Leverages inductive clock grid

� Leverage technology (deep trench capacitance)� Enables significant on-chip decap (~20uF) � Mid/high-freq noise significantly reduced

Source: Dale Becker, IBM

Page 21: The Cloud & Its Impact on IT

© 201421

Networking

� Networking is THE biggest issue that Cloud DC are grappling with

� Cost take down not following ‘Moore’s Law’ � Big push for SDN

� Intra-datacenter (East-West) traffic increasing� 44% CAGR in DC traffic, 76% within DC

(Cisco 2012*)� 80%+ of Google traffic now internal facing

(B. Koley, Google OI Conf. 2012)� Every 1kb of external traffic entering the

datacenter generates 930kb of internal traffic (N.Farrington, Facebook OI Conf. 2013)

Source: Marc Taubenblatt, IBM

Page 22: The Cloud & Its Impact on IT

© 2014

The Case for Optics

CopperCopper OpticsOptics

Up t

o 8

0K

m f

or

Eth

ern

et,

100G

bps

Bandw

idth

at

low

pow

er

• Increasing benefits with optical, but products generally cost more than copper • Optics less expensive when integrated with silicon - Silicon Photonics

• Photonics integrated into silicon base• Reduces cost and provides higher bandwidth

Page 23: The Cloud & Its Impact on IT

© 201423

Cloud Network

Speed 10�40�100�400Gbps/drawer

Distance m’s to km’s

Protocol Commodity*/ Standards

Packaging Mid Card / Card edge

The Case for Optics

Source: Marc Taubenblatt, IBM

Electrical getting harder with increased data rate as loss increases

Page 24: The Cloud & Its Impact on IT

© 201424

Why Disaggregation

Storage

Memory

Network and I/O

GPU/ Accelerators

CPU’s

Today� Balanced Compute/Network/Storage� Refreshed to optimize sub-

component (cost / performance / bottlenecks)

Compute Node(s)

Mem CPU NIC

Patch Panel / Switch and/or Shared “NIC”

Storage Node(s)

GPU Node(s)

Memory Node(s)

Network Connectivity

Fabric

Disaggregation� Flexible, Composable (workload optimized)� Enables higher density compute architecture � Software defined deployment� Improved utilization, TCO & TCA� Independent technology refresh cycles

Page 25: The Cloud & Its Impact on IT

© 201425

Disaggregation

Electrical Switch

Tight Integration< few U

Rack Level Integration

POD/DC Level Integration

Electrical?

Optics ?

Optical Switch

Page 26: The Cloud & Its Impact on IT

© 201426

SoftLayer DataCenter

QUESTIONS ?