warehouse-scale computing mu li, kiryong ha 10/17/2012 15-740 computer architecture

Warehouse-Scale Computing

Mu Li, Kiryong Ha

10/17/201215-740 Computer Architecture

Overview• Motivation– Explore architectural issues as the computing moves

toward the could

1. Impact of sharing memory subsystem resources (LLC, memory bandwidth ..)

2. Maximize resource utilization by co-locating applications without hurting QoS

3. Inefficiencies on traditional processors for running scale-out workloads

Overview

Paper Problem ApproachThe Impact of Memory Subsystem Resource Sharing on Datacenter Applications

Sharing in Memory subsystem Software

Bubble-Up Resource Utilization Software

Clearing the cloud Inefficiencies for scale-out workload

Software

Scale-out processors Improve scale-out workload performance

Hardware

Impact of memory subsystem sharing

Impact of memory subsystem sharing

• Motivation & Problem definition– Machines have multi-core, multi-socket

– For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB)

It is important to understand the memory sharing interaction between (datacenter) applications

Impact of thread-to-core mapping

Sharing Cache Separate FSBs

(XX..XX..)

Sharing CacheSharing FSBs

(XXXX….)

Separate CacheSeparate FSBs

(X.X.X.X.)

Impact of thread-to-core mapping

- Performance varies up to 20%- Each Application has different trend.

- TTC behavior changes depending on co-located application.

<content analyzer co-located with other>

Observation1. Performance can significantly swing simply based on how

application threads are mapped to cores.

2. Best TTC mapping changes depends on co-located program.

3. Application characteristics that impact performance – Memory bus usage, Cache line sharing, Cache footprint

– Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint Works better if it doesn’t share LLC and FSB STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB

Increasing Utilization in Warehouse scale Computers via Co-location

Increasing Utilization via Co-location• Motivation

– Cloud computing wants to get higher resource utilization.– However, overprovisioning is used to ensure the performance isolation

for latency-sensitive task, which lowers the utilization.

Need precise prediction in shared resource for better utilization without violating QoS.

<Google’s web search QoS co-located with other products>

Bubble-up Methodology1. QoS sensitivity curve ( )

– Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem

2. Bubble score ( )– Get amount of pressure that the application causes on a reporter

<sensitivity curve for Bigtable><sensitivity curve> <Pressure score>

Better Utilization• Now we know

1) how QoS changes depending on bubble size (QoS curve)2) how the application can affect to others (bubble number)

Can co-locate applications estimatiing changes on QoS

<Utilization improvement with search-render under each QoS>

Scale-out workload

Scale-out workload• Examples:– Data Severing– Mapreduce– Media Streaming– SAT Solver– Web Frontend– Web Search

Execution-time breakdown

• A major part of time is waiting for caches misses A clear micro-architectural mismatch

Frontend ineffficiencies• Cores idle due to high instruction-cache miss

rates• L2 caches increase average I-fetch latency• Excessive LLC capacity leads to long I-fetch

latency• How to improve?– Bring instructions closer to the cores

Core inefficiencies• Low instruction level parallelism precludes

effectively using the full core width• Low memory level parallelism underutilizes

reorder buffers and load-store queues.• How to improve?– Run many things together: multi-threaded multi-

core architecture

Data-access inefficiencies

• Large LLC consumes area, but does not improve performance

• Simple data prefetchers are ineffective

• How to improve?– Reduce LLC, leave place

for processers

Bandwidth inefficiencies• Lack of data sharing deprecates coherence

and connectivity• Off-chip bandwidth exceeds needs by an order

of magnitude• How to improve?– Scale back on-chip interconnect and off-chip

memory bus to give place for processors

Scale-out processors• So, too large LLC, interconnect, memory bus,

but not enough processors• Here comes a better one:

Improve throughput by 5x-6.5x!

• Q&A or Discussion

Supplement slides

Datacenter Applications

Application

Description Metric Type

content analyzer

Throughput latency-sensitive

bigtable average latency latency-sensitive

websearch queries per second

latency-sensitive

stitcher Batchprotobuf Batch

- Google’s production application

Key takeaways• TTC behavior is mostly determined by

– Memory bus usage (for FSB sharing)

– Data sharing: Cache line sharing

– Cache footprint: Use last level cache miss to estimate the foot print size

• Example– CONTENT ANALYZER has high bus usage, little cache sharing, large

cache footprint Works better if it does not share LLC and FSB

– Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch

• 1% prediction error on average

Prediction accuracy for pairwise co-locations of Google applications

warehouse-scale computing mu li, kiryong ha 10/17/2012 15-740 computer architecture

Documents

qos slide

workload slide

fsb slide

cores slide

location slide

reporter slide

cache line sharing

little cache sharing