warehouse-scale computing mu li, kiryong ha 10/17/2012 15-740 computer architecture

25
Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Upload: carmel-greer

Post on 18-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Warehouse-Scale Computing

Mu Li, Kiryong Ha

10/17/201215-740 Computer Architecture

Page 2: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Overview• Motivation– Explore architectural issues as the computing moves

toward the could

1. Impact of sharing memory subsystem resources (LLC, memory bandwidth ..)

2. Maximize resource utilization by co-locating applications without hurting QoS

3. Inefficiencies on traditional processors for running scale-out workloads

Page 3: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Overview

Paper Problem ApproachThe Impact of Memory Subsystem Resource Sharing on Datacenter Applications

Sharing in Memory subsystem Software

Bubble-Up Resource Utilization Software

Clearing the cloud Inefficiencies for scale-out workload

Software

Scale-out processors Improve scale-out workload performance

Hardware

Page 4: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Impact of memory subsystem sharing

Page 5: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Impact of memory subsystem sharing

• Motivation & Problem definition– Machines have multi-core, multi-socket

– For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB)

It is important to understand the memory sharing interaction between (datacenter) applications

Page 6: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Impact of thread-to-core mapping

Sharing Cache Separate FSBs

(XX..XX..)

Sharing CacheSharing FSBs

(XXXX….)

Separate CacheSeparate FSBs

(X.X.X.X.)

Page 7: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Impact of thread-to-core mapping

- Performance varies up to 20%- Each Application has different trend.

- TTC behavior changes depending on co-located application.

<content analyzer co-located with other>

Page 8: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Observation1. Performance can significantly swing simply based on how

application threads are mapped to cores.

2. Best TTC mapping changes depends on co-located program.

3. Application characteristics that impact performance – Memory bus usage, Cache line sharing, Cache footprint

– Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint Works better if it doesn’t share LLC and FSB STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB

Page 9: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Increasing Utilization in Warehouse scale Computers via Co-location

Page 10: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Increasing Utilization via Co-location• Motivation

– Cloud computing wants to get higher resource utilization.– However, overprovisioning is used to ensure the performance isolation

for latency-sensitive task, which lowers the utilization.

Need precise prediction in shared resource for better utilization without violating QoS.

<Google’s web search QoS co-located with other products>

Page 11: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Bubble-up Methodology1. QoS sensitivity curve ( )

– Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem

2. Bubble score ( )– Get amount of pressure that the application causes on a reporter

<sensitivity curve for Bigtable><sensitivity curve> <Pressure score>

Page 12: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Better Utilization• Now we know

1) how QoS changes depending on bubble size (QoS curve)2) how the application can affect to others (bubble number)

Can co-locate applications estimatiing changes on QoS

<Utilization improvement with search-render under each QoS>

Page 13: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Scale-out workload

Page 14: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Scale-out workload• Examples:– Data Severing– Mapreduce– Media Streaming– SAT Solver– Web Frontend– Web Search

Page 15: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Execution-time breakdown

• A major part of time is waiting for caches misses A clear micro-architectural mismatch

Page 16: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Frontend ineffficiencies• Cores idle due to high instruction-cache miss

rates• L2 caches increase average I-fetch latency• Excessive LLC capacity leads to long I-fetch

latency• How to improve?– Bring instructions closer to the cores

Page 17: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Core inefficiencies• Low instruction level parallelism precludes

effectively using the full core width• Low memory level parallelism underutilizes

reorder buffers and load-store queues.• How to improve?– Run many things together: multi-threaded multi-

core architecture

Page 18: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Data-access inefficiencies

• Large LLC consumes area, but does not improve performance

• Simple data prefetchers are ineffective

• How to improve?– Reduce LLC, leave place

for processers

Page 19: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Bandwidth inefficiencies• Lack of data sharing deprecates coherence

and connectivity• Off-chip bandwidth exceeds needs by an order

of magnitude• How to improve?– Scale back on-chip interconnect and off-chip

memory bus to give place for processors

Page 20: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Scale-out processors• So, too large LLC, interconnect, memory bus,

but not enough processors• Here comes a better one:

Improve throughput by 5x-6.5x!

Page 21: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

• Q&A or Discussion

Page 22: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Supplement slides

Page 23: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Datacenter Applications

Application

Description Metric Type

content analyzer

Throughput latency-sensitive

bigtable average latency latency-sensitive

websearch queries per second

latency-sensitive

stitcher Batchprotobuf Batch

- Google’s production application

Page 24: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

Key takeaways• TTC behavior is mostly determined by

– Memory bus usage (for FSB sharing)

– Data sharing: Cache line sharing

– Cache footprint: Use last level cache miss to estimate the foot print size

• Example– CONTENT ANALYZER has high bus usage, little cache sharing, large

cache footprint Works better if it does not share LLC and FSB

– Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch

Page 25: Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture

• 1% prediction error on average

Prediction accuracy for pairwise co-locations of Google applications