colorado computer architecture research group architectural support for enhanced smt job scheduling...

Colorado Computer Architecture Research Group

Architectural Support for Enhanced SMT Job

SchedulingAlex Settle

Joshua KihmAndy Janiszewski Daniel A. Connors

University of Colorado at Boulder

Colorado Architecture Research

Group2

Introduction Shared memory systems of SMT processors limit performance

Threads continuously compete for shared cache resources Interference between threads causes workload slowdown

Detecting thread interference is a challenge for real systems Low level cache monitoring Difficult to exploit run-time data

Goal Design the performance monitoring hardware required to capture

thread interference information that can be exposed to the operating system scheduler to improve workload performance


Group3

Simultaneous Multithreading (SMT) Concurrently executes instructions from different

contexts Thread level parallelism (TLP) Improves instruction level parallelism (ILP)

Improves utilization of base processor Intel Pentium 4 Xeon

2 level cache hierarchy Instruction trace cache 8K data cache 4 way associative 64 bytes per line 512K L2 cache – Unified 8 way associative 64 bytes per line

2 way SMT


Group4

Inter-thread Interference Competition for shared resources

Memory system Buses Physical cache storage

Fetch and issue queues Functional units

Threads evict cache data belonging to other threads Increase in cache misses Diminishes processor utilization

Inter-thread kick outs (ITKO) Measured in simulator

Thread id of evicted cache line compared to new cache line Increased ITKO leads to decrease in IPC


Group5

ITKO to IPC Correlation Level 3 Cache

IPC recorded for each phase intervalHigh ITKO rate leads to significant drop in IPCLarge variability in IPC over workload lifetime – cache interference


Group6

Related Work Different levels of addressing the interference problem

Compiler [Kumar,Tullsen; Micro 02] Procedure placement optimization

workload fixed at compile time [J. Lo; Micro 97] Tailoring compiler optimizations for SMT

Effects of traditional optimizations on SMT performance Static optimizations

Operating System [Tullsen, Snavely; ASPLOS 00] Symbiotic job scheduling

Profile based, simulated OS and architecture [J. Lo; ISCA 98] Data cache address remapping

workload dependent, data base applications Microarchitecture

[Brown; Micro 01] - Issue policy feedback from memory system Improved fetch and issue resource allocation Does not tackle inter-thread interference


Group7

Motivation Improve performance by reducing inter-thread interference

Multi-faceted problem Dependent on thread pairings Occurs at low-level cache line granularity

Difficult to detect at runtime OS scheduling decisions affect microarchitecture performance

Observed on both simulator and real system Observation

Cache access footprints vary over program lifetimes Accesses are concentrated in small cache regions


Group8

Concentration of L2-Cache Access

Cache access and miss footprints vary across program Cache access and miss footprints vary across program phasesphases

Intervals with high access and miss rates are concentrated in Intervals with high access and miss rates are concentrated in small physical regions of the cache (green, red) small physical regions of the cache (green, red)

Current performance counters can not detect that activity is Current performance counters can not detect that activity is concentrated in small regionsconcentrated in small regions


Group9

Cache Use Map: Runtime Monitoring

Spatial locality vertical

Temporal locality horizontal


Group10

Benchmark Pairings ITKOgzip/mesagzip/mesa

equake/perlequake/perl

mesa/perlmesa/perl

gzip/perlgzip/perl

mesa/equakemesa/equake

gzip/equakegzip/equake

Yellow represents very high interferenceYellow represents very high interference Interference is dependent on job mixInterference is dependent on job mix


Group11

Performance Guided Scheduling Theory

Total ITKOsBest Static: 2.91 MillionDynamic: 2.55 Million




equakeequake gzipgzip

perlperl mesamesa

Each phase scheduler selects jobs with least interferenceEach phase scheduler selects jobs with least interference


Group12

Solution to Inter-thread Interference Predict future interference

Capture inter-thread interference behavior Introduce cache line activity counters

Expose to operating system Current schedulers use symmetric multiprocessing

(SMP) algorithms for SMT processors

Activity based job scheduler Schedule for minimal inter-thread interference


Group13

7949

4271

3678

1760

2511

2204

1474

Activity Vectors Interface between OS and microarchitecture Divide cache into Super Sets

Access counters assigned to each super set One vector bit corresponds to each counter

Bit is set when threshold is exceeded Job scheduler

Compare active vector with jobs in run queue Selects job with fewest common set bits

1

0

0

1

1

1

0

Xi>1024?

1

0

1

1

0

Xi>2048?

1

0

Xi>4096?

1234

526

876

1635

1067

1137

1220

254

Thresholds established through static analysis

Global median across all benchmarks


Group14

Vector Prediction - Simulator Use last vector to approximate next vector

Average accuracy 91% Simple and effective

Activity Vector Use Predictability Miss Predictability

D-Cache 82.3% 93.6%

I-Cache 94.9% 90.3%

L2-Cache 93.8% 94.6%

Average 90.3% 92.8%


Group15

OS Scheduling Algorithm

perlbmkperlbmk

gzipgzip

mesamesa

OS taskOS task

OS taskOS task

mcfmcf

ammpammp

parserparser

twolftwolf

CPU 0CPU 0 CPU 1CPU 1PhysicalPhysical

processorprocessor

Twolf vectorTwolf vector

CMPCMP

vectorsvectorsRun queue 0Run queue 0 Run queue 1Run queue 1

Weighted sum of vectors at each levelWeighted sum of vectors at each level Vectors from L2 given highest weightVectors from L2 given highest weight


Group16

Activity Vector Procedure

Real system Modified Linux kernel 2.6.0 Tested on Intel P4 Xeon Hyper-threading

Emulated activity counter registers Generate vectors off-line

Valgrind memory simulator Text file output Copy vectors to kernel memory space

Activate vector scheduler Time and run workloads

Program Phase

D-cache Vector

L2-cache Vector

0 11100110 00111011

1 11000000 01111000

2 00111101 11010000

N 11100001 00011100

SimulatorVector hardware Simulated OS


Group17

Workloads - Xeon

WL1 gzip.vpr.gcc.mesa.art.mcf.equake.crafty

WL2 parser.gap.vortex.bzip2.vpr.mesa.crafty.mcf

WL3 Mesa.twolf.vortex.gzip.gcc.art.crafty.vpr

WL4 Gzip.twolf.vpr.bzip2.gcc.gap.mesa.parser

WL5 Equake.crafty.mcf.parser.art.gap.mesa.vortex

WL6 twolf.bzip2.vortex.gap.parser.crafty.equake.mcf

8 Spec 2000 jobs per workloadCombination of integer and floating point applications Run to completion in parallel with OS level jobs


Group18

Comparison of Scheduling Algorithms

Scheduler Selection Analysis

40%

50%

60%

70%

80%

90%

100%

Wl_1 Wl_2 Wl_3 Wl_4 Wl_5 Wl_6

Workloads

Sc

he

du

le In

vo

ca

tio

ns

Middle

Bad

Good

Default Linux vs. Activity based More than 30% of default scheduler decisions could

have been improved by the activity based scheduler


Group19

Activity Vector Performance - Xeon Performance Improvement

0

1

2

3

4

5

6

7

WL_1 WL_2 WL_3 WL_4 WL_5 WL_6 Average

Workload

Perc

en

t Im

pro

vem

en

t

Execution Time

L2 Misses


Group20

Comparing Activity Vectors to Existing Performance Counters - Simulation

Benchmark Mix % Diff.

164.gzip, 164.gzip, 181.mcf, 183.equake 0.0%

164.gzip, 164.gzip, 188.ammp, 300.twolf 12.0%

164.gzip, 177.mesa, 181.mcf, 183.equake 0.0%

164.gzip, 177.mesa, 183.equake, 183.equake 0.0%

164.gzip, 197.parser, 253.perlbmk, 300.twolf 44.4%

177.mesa, 177.mesa, 197.parser, 300.twolf 11.1%

177.mesa, 181.mcf, 253.perlbmk, 256.bzip2 0.0%

177.mesa, 188.ammp, 253.perlbmk, 300.twolf 59.5%

177.mesa, 197.parser, 197.parser, 256.bzip2 96.2%

181.mcf, 181.mcf, 256.bzip2, 256.bzip2 0.0%

181.mcf, 183.equake, 253.perlbmk, 300.twolf 4.0%

181.mcf, 253.perlbmk, 253.perlbmk, 256.bzip2 0.0%

183.equake, 188.ammp, 188.ammp, 256.bzip2 11.1%

188.ammp, 188.ammp, 197.parser, 197.parser 96.2%

188.ammp, 300.twolf, 300.twolf, 300.twolf 8.0%

197.parser, 197.parser, 253.perlbmk, 256.bzip2 0.0%

Average 22.5%

On average activity schedule makes different decisions than the performance counter based schedule 23% of the time


Group21

ITKO Reduction - SimulationBenchmarks %ITKO Reduction %IPC Gain

gzip.gzip.mcf.equake 54.0% 3.6%

gzip.gzip.ammp.twolf 10.5% 4.5%

gzip.mesa.mcf.equake 39.5% 3.0%

gzip.mesa.equake.equake 47.0% 2.4%

mesa.mesa.parser.twolf 10.3% 4.8%

mcf.equake.perlbmk.twolf 1.7% 3.0%

mcf.perlbmk.perlbmk.bzip2 13.0% 12.1%

ammp.twolf.twolf.twolf 1.9% 6.1%

Average 22% 5%


Group22

Contributions Interference analysis of cache accesses Introduce fine grained performance counters General purpose adaptable optimization

Expose microarchitecture to OS Workload independent

Tested on a real SMT machine Implemented on Linux kernel 2 way SMT core


Group23

Activity Based Scheduling Summary Prevents inter-thread interference

Monitors cache access behavior Co-schedules jobs with expected low interference Adapts to phased workload behavior

Performance improvements Greater than 30% opportunity to improve the

default Linux scheduling decisions 22% Reduction in inter-thread interference 5% Improvement in execution time


Group24

Thank You


Group25

Super Set Size What happens when we change the number

of super sets used. Can we include a graph here?

Slide 17 once we have the data… May want to include the tree chart


Group26

Performance Challenges Difficult to detect interference Inter-thread interference is a multi-faceted problem

Occurs at low-level cache line granularity Temporal variability in benchmark memory requests Dependent on thread pairings

OS scheduling decisions affect performance Current systems

Increased cache associativity Could use PMU register feedback


Group27

Activity Vectors Interface between OS and microarchitecture Divide cache into Super Sets

Access counters assigned to each super set One vector bit corresponds to each counter

Bit is set when threshold is exceeded Job scheduler

Compare active vector with jobs in run queue Selects job with fewest common set bits

1

0

0

1

1

1

0

0

Xi>1024?

1234

526

876

1635

1067

1137

1000

254

Expect interferenceExpect interference

Expect no interferenceExpect no interference


Group28

OS Scheduling OS scheduling important when more jobs than

contexts Current schedulers use symmetric

multiprocessing (SMP) algorithms for SMT processors

Proposed work For each time interval co-schedule jobs whose

cache accesses are in different regions


Group29

Prevent jobs from running together during program phases where they exhibit high degrees of cache interference

Program Phase

D-cache Vector

L2-cache Vector

0 11100110 00111011

1 11000000 01111000

2 00111101 11010000

N 11100001 00011100

colorado computer architecture research group architectural support for enhanced smt job scheduling...

Documents

cache ipc

cache data

cache access cache access

l2 cache

cache misses

small cache regions

ipc slide

workload performance