warehouse-scale computing mu li, kiryong ha 10/17/2012 15-740 computer architecture
TRANSCRIPT
Warehouse-Scale Computing
Mu Li, Kiryong Ha
10/17/201215-740 Computer Architecture
Overview• Motivation– Explore architectural issues as the computing moves
toward the could
1. Impact of sharing memory subsystem resources (LLC, memory bandwidth ..)
2. Maximize resource utilization by co-locating applications without hurting QoS
3. Inefficiencies on traditional processors for running scale-out workloads
Overview
Paper Problem ApproachThe Impact of Memory Subsystem Resource Sharing on Datacenter Applications
Sharing in Memory subsystem Software
Bubble-Up Resource Utilization Software
Clearing the cloud Inefficiencies for scale-out workload
Software
Scale-out processors Improve scale-out workload performance
Hardware
Impact of memory subsystem sharing
Impact of memory subsystem sharing
• Motivation & Problem definition– Machines have multi-core, multi-socket
– For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB)
It is important to understand the memory sharing interaction between (datacenter) applications
Impact of thread-to-core mapping
Sharing Cache Separate FSBs
(XX..XX..)
Sharing CacheSharing FSBs
(XXXX….)
Separate CacheSeparate FSBs
(X.X.X.X.)
Impact of thread-to-core mapping
- Performance varies up to 20%- Each Application has different trend.
- TTC behavior changes depending on co-located application.
<content analyzer co-located with other>
Observation1. Performance can significantly swing simply based on how
application threads are mapped to cores.
2. Best TTC mapping changes depends on co-located program.
3. Application characteristics that impact performance – Memory bus usage, Cache line sharing, Cache footprint
– Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint Works better if it doesn’t share LLC and FSB STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB
Increasing Utilization in Warehouse scale Computers via Co-location
Increasing Utilization via Co-location• Motivation
– Cloud computing wants to get higher resource utilization.– However, overprovisioning is used to ensure the performance isolation
for latency-sensitive task, which lowers the utilization.
Need precise prediction in shared resource for better utilization without violating QoS.
<Google’s web search QoS co-located with other products>
Bubble-up Methodology1. QoS sensitivity curve ( )
– Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem
2. Bubble score ( )– Get amount of pressure that the application causes on a reporter
<sensitivity curve for Bigtable><sensitivity curve> <Pressure score>
Better Utilization• Now we know
1) how QoS changes depending on bubble size (QoS curve)2) how the application can affect to others (bubble number)
Can co-locate applications estimatiing changes on QoS
<Utilization improvement with search-render under each QoS>
Scale-out workload
Scale-out workload• Examples:– Data Severing– Mapreduce– Media Streaming– SAT Solver– Web Frontend– Web Search
Execution-time breakdown
• A major part of time is waiting for caches misses A clear micro-architectural mismatch
Frontend ineffficiencies• Cores idle due to high instruction-cache miss
rates• L2 caches increase average I-fetch latency• Excessive LLC capacity leads to long I-fetch
latency• How to improve?– Bring instructions closer to the cores
Core inefficiencies• Low instruction level parallelism precludes
effectively using the full core width• Low memory level parallelism underutilizes
reorder buffers and load-store queues.• How to improve?– Run many things together: multi-threaded multi-
core architecture
Data-access inefficiencies
• Large LLC consumes area, but does not improve performance
• Simple data prefetchers are ineffective
• How to improve?– Reduce LLC, leave place
for processers
Bandwidth inefficiencies• Lack of data sharing deprecates coherence
and connectivity• Off-chip bandwidth exceeds needs by an order
of magnitude• How to improve?– Scale back on-chip interconnect and off-chip
memory bus to give place for processors
Scale-out processors• So, too large LLC, interconnect, memory bus,
but not enough processors• Here comes a better one:
Improve throughput by 5x-6.5x!
• Q&A or Discussion
Supplement slides
Datacenter Applications
Application
Description Metric Type
content analyzer
Throughput latency-sensitive
bigtable average latency latency-sensitive
websearch queries per second
latency-sensitive
stitcher Batchprotobuf Batch
- Google’s production application
Key takeaways• TTC behavior is mostly determined by
– Memory bus usage (for FSB sharing)
– Data sharing: Cache line sharing
– Cache footprint: Use last level cache miss to estimate the foot print size
• Example– CONTENT ANALYZER has high bus usage, little cache sharing, large
cache footprint Works better if it does not share LLC and FSB
– Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch
• 1% prediction error on average
Prediction accuracy for pairwise co-locations of Google applications