colorado computer architecture research group architectural support for enhanced smt job scheduling...
Post on 19-Dec-2015
214 views
TRANSCRIPT
Colorado Computer Architecture Research Group
Architectural Support for Enhanced SMT Job
SchedulingAlex Settle
Joshua KihmAndy Janiszewski Daniel A. Connors
University of Colorado at Boulder
Colorado Architecture Research
Group2
Introduction Shared memory systems of SMT processors limit performance
Threads continuously compete for shared cache resources Interference between threads causes workload slowdown
Detecting thread interference is a challenge for real systems Low level cache monitoring Difficult to exploit run-time data
Goal Design the performance monitoring hardware required to capture
thread interference information that can be exposed to the operating system scheduler to improve workload performance
Colorado Architecture Research
Group3
Simultaneous Multithreading (SMT) Concurrently executes instructions from different
contexts Thread level parallelism (TLP) Improves instruction level parallelism (ILP)
Improves utilization of base processor Intel Pentium 4 Xeon
2 level cache hierarchy Instruction trace cache 8K data cache 4 way associative 64 bytes per line 512K L2 cache – Unified 8 way associative 64 bytes per line
2 way SMT
Colorado Architecture Research
Group4
Inter-thread Interference Competition for shared resources
Memory system Buses Physical cache storage
Fetch and issue queues Functional units
Threads evict cache data belonging to other threads Increase in cache misses Diminishes processor utilization
Inter-thread kick outs (ITKO) Measured in simulator
Thread id of evicted cache line compared to new cache line Increased ITKO leads to decrease in IPC
Colorado Architecture Research
Group5
ITKO to IPC Correlation Level 3 Cache
IPC recorded for each phase intervalHigh ITKO rate leads to significant drop in IPCLarge variability in IPC over workload lifetime – cache interference
Colorado Architecture Research
Group6
Related Work Different levels of addressing the interference problem
Compiler [Kumar,Tullsen; Micro 02] Procedure placement optimization
workload fixed at compile time [J. Lo; Micro 97] Tailoring compiler optimizations for SMT
Effects of traditional optimizations on SMT performance Static optimizations
Operating System [Tullsen, Snavely; ASPLOS 00] Symbiotic job scheduling
Profile based, simulated OS and architecture [J. Lo; ISCA 98] Data cache address remapping
workload dependent, data base applications Microarchitecture
[Brown; Micro 01] - Issue policy feedback from memory system Improved fetch and issue resource allocation Does not tackle inter-thread interference
Colorado Architecture Research
Group7
Motivation Improve performance by reducing inter-thread interference
Multi-faceted problem Dependent on thread pairings Occurs at low-level cache line granularity
Difficult to detect at runtime OS scheduling decisions affect microarchitecture performance
Observed on both simulator and real system Observation
Cache access footprints vary over program lifetimes Accesses are concentrated in small cache regions
Colorado Architecture Research
Group8
Concentration of L2-Cache Access
Cache access and miss footprints vary across program Cache access and miss footprints vary across program phasesphases
Intervals with high access and miss rates are concentrated in Intervals with high access and miss rates are concentrated in small physical regions of the cache (green, red) small physical regions of the cache (green, red)
Current performance counters can not detect that activity is Current performance counters can not detect that activity is concentrated in small regionsconcentrated in small regions
Colorado Architecture Research
Group9
Cache Use Map: Runtime Monitoring
Spatial locality vertical
Temporal locality horizontal
Colorado Architecture Research
Group10
Benchmark Pairings ITKOgzip/mesagzip/mesa
equake/perlequake/perl
mesa/perlmesa/perl
gzip/perlgzip/perl
mesa/equakemesa/equake
gzip/equakegzip/equake
Yellow represents very high interferenceYellow represents very high interference Interference is dependent on job mixInterference is dependent on job mix
Colorado Architecture Research
Group11
Performance Guided Scheduling Theory
Total ITKOsBest Static: 2.91 MillionDynamic: 2.55 Million
Total ITKOsBest Static: 2.91 MillionDynamic: 2.90 Million
Total ITKOsBest Static: 7.30 MillionDynamic: 6.70 Million
Total ITKOsBest Static: 2.91 MillionDynamic: 2.55 Million
equakeequake gzipgzip
perlperl mesamesa
Each phase scheduler selects jobs with least interferenceEach phase scheduler selects jobs with least interference
Colorado Architecture Research
Group12
Solution to Inter-thread Interference Predict future interference
Capture inter-thread interference behavior Introduce cache line activity counters
Expose to operating system Current schedulers use symmetric multiprocessing
(SMP) algorithms for SMT processors
Activity based job scheduler Schedule for minimal inter-thread interference
Colorado Architecture Research
Group13
7949
4271
3678
1760
2511
2204
1474
Activity Vectors Interface between OS and microarchitecture Divide cache into Super Sets
Access counters assigned to each super set One vector bit corresponds to each counter
Bit is set when threshold is exceeded Job scheduler
Compare active vector with jobs in run queue Selects job with fewest common set bits
1
0
0
1
1
1
0
Xi>1024?
1
0
1
1
0
Xi>2048?
1
0
Xi>4096?
1234
526
876
1635
1067
1137
1220
254
Thresholds established through static analysis
Global median across all benchmarks
Colorado Architecture Research
Group14
Vector Prediction - Simulator Use last vector to approximate next vector
Average accuracy 91% Simple and effective
Activity Vector Use Predictability Miss Predictability
D-Cache 82.3% 93.6%
I-Cache 94.9% 90.3%
L2-Cache 93.8% 94.6%
Average 90.3% 92.8%
Colorado Architecture Research
Group15
OS Scheduling Algorithm
perlbmkperlbmk
gzipgzip
mesamesa
OS taskOS task
OS taskOS task
mcfmcf
ammpammp
parserparser
twolftwolf
CPU 0CPU 0 CPU 1CPU 1PhysicalPhysical
processorprocessor
Twolf vectorTwolf vector
CMPCMP
vectorsvectorsRun queue 0Run queue 0 Run queue 1Run queue 1
Weighted sum of vectors at each levelWeighted sum of vectors at each level Vectors from L2 given highest weightVectors from L2 given highest weight
Colorado Architecture Research
Group16
Activity Vector Procedure
Real system Modified Linux kernel 2.6.0 Tested on Intel P4 Xeon Hyper-threading
Emulated activity counter registers Generate vectors off-line
Valgrind memory simulator Text file output Copy vectors to kernel memory space
Activate vector scheduler Time and run workloads
Program Phase
D-cache Vector
L2-cache Vector
0 11100110 00111011
1 11000000 01111000
2 00111101 11010000
N 11100001 00011100
SimulatorVector hardware Simulated OS
Colorado Architecture Research
Group17
Workloads - Xeon
WL1 gzip.vpr.gcc.mesa.art.mcf.equake.crafty
WL2 parser.gap.vortex.bzip2.vpr.mesa.crafty.mcf
WL3 Mesa.twolf.vortex.gzip.gcc.art.crafty.vpr
WL4 Gzip.twolf.vpr.bzip2.gcc.gap.mesa.parser
WL5 Equake.crafty.mcf.parser.art.gap.mesa.vortex
WL6 twolf.bzip2.vortex.gap.parser.crafty.equake.mcf
8 Spec 2000 jobs per workloadCombination of integer and floating point applications Run to completion in parallel with OS level jobs
Colorado Architecture Research
Group18
Comparison of Scheduling Algorithms
Scheduler Selection Analysis
40%
50%
60%
70%
80%
90%
100%
Wl_1 Wl_2 Wl_3 Wl_4 Wl_5 Wl_6
Workloads
Sc
he
du
le In
vo
ca
tio
ns
Middle
Bad
Good
Default Linux vs. Activity based More than 30% of default scheduler decisions could
have been improved by the activity based scheduler
Colorado Architecture Research
Group19
Activity Vector Performance - Xeon Performance Improvement
0
1
2
3
4
5
6
7
WL_1 WL_2 WL_3 WL_4 WL_5 WL_6 Average
Workload
Perc
en
t Im
pro
vem
en
t
Execution Time
L2 Misses
Colorado Architecture Research
Group20
Comparing Activity Vectors to Existing Performance Counters - Simulation
Benchmark Mix % Diff.
164.gzip, 164.gzip, 181.mcf, 183.equake 0.0%
164.gzip, 164.gzip, 188.ammp, 300.twolf 12.0%
164.gzip, 177.mesa, 181.mcf, 183.equake 0.0%
164.gzip, 177.mesa, 183.equake, 183.equake 0.0%
164.gzip, 197.parser, 253.perlbmk, 300.twolf 44.4%
177.mesa, 177.mesa, 197.parser, 300.twolf 11.1%
177.mesa, 181.mcf, 253.perlbmk, 256.bzip2 0.0%
177.mesa, 188.ammp, 253.perlbmk, 300.twolf 59.5%
177.mesa, 197.parser, 197.parser, 256.bzip2 96.2%
181.mcf, 181.mcf, 256.bzip2, 256.bzip2 0.0%
181.mcf, 183.equake, 253.perlbmk, 300.twolf 4.0%
181.mcf, 253.perlbmk, 253.perlbmk, 256.bzip2 0.0%
183.equake, 188.ammp, 188.ammp, 256.bzip2 11.1%
188.ammp, 188.ammp, 197.parser, 197.parser 96.2%
188.ammp, 300.twolf, 300.twolf, 300.twolf 8.0%
197.parser, 197.parser, 253.perlbmk, 256.bzip2 0.0%
Average 22.5%
On average activity schedule makes different decisions than the performance counter based schedule 23% of the time
Colorado Architecture Research
Group21
ITKO Reduction - SimulationBenchmarks %ITKO Reduction %IPC Gain
gzip.gzip.mcf.equake 54.0% 3.6%
gzip.gzip.ammp.twolf 10.5% 4.5%
gzip.mesa.mcf.equake 39.5% 3.0%
gzip.mesa.equake.equake 47.0% 2.4%
mesa.mesa.parser.twolf 10.3% 4.8%
mcf.equake.perlbmk.twolf 1.7% 3.0%
mcf.perlbmk.perlbmk.bzip2 13.0% 12.1%
ammp.twolf.twolf.twolf 1.9% 6.1%
Average 22% 5%
Colorado Architecture Research
Group22
Contributions Interference analysis of cache accesses Introduce fine grained performance counters General purpose adaptable optimization
Expose microarchitecture to OS Workload independent
Tested on a real SMT machine Implemented on Linux kernel 2 way SMT core
Colorado Architecture Research
Group23
Activity Based Scheduling Summary Prevents inter-thread interference
Monitors cache access behavior Co-schedules jobs with expected low interference Adapts to phased workload behavior
Performance improvements Greater than 30% opportunity to improve the
default Linux scheduling decisions 22% Reduction in inter-thread interference 5% Improvement in execution time
Colorado Architecture Research
Group25
Super Set Size What happens when we change the number
of super sets used. Can we include a graph here?
Slide 17 once we have the data… May want to include the tree chart
Colorado Architecture Research
Group26
Performance Challenges Difficult to detect interference Inter-thread interference is a multi-faceted problem
Occurs at low-level cache line granularity Temporal variability in benchmark memory requests Dependent on thread pairings
OS scheduling decisions affect performance Current systems
Increased cache associativity Could use PMU register feedback
Colorado Architecture Research
Group27
Activity Vectors Interface between OS and microarchitecture Divide cache into Super Sets
Access counters assigned to each super set One vector bit corresponds to each counter
Bit is set when threshold is exceeded Job scheduler
Compare active vector with jobs in run queue Selects job with fewest common set bits
1
0
0
1
1
1
0
0
Xi>1024?
1234
526
876
1635
1067
1137
1000
254
Expect interferenceExpect interference
Expect no interferenceExpect no interference
Colorado Architecture Research
Group28
OS Scheduling OS scheduling important when more jobs than
contexts Current schedulers use symmetric
multiprocessing (SMP) algorithms for SMT processors
Proposed work For each time interval co-schedule jobs whose
cache accesses are in different regions