rapid identification of architectural bottlenecks via precise event counting

66
Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme , Simha Sethumadhavan Columbia University {jdd,simha}@cs.columbia.edu

Upload: candie

Post on 23-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Rapid Identification of Architectural Bottlenecks via Precise Event Counting. John Demme , Simha Sethumadhavan Columbia University { jdd,simha }@ cs.columbia.edu. 2002. Platforms . Source: TIOBE Index http:// www.tiobe.com / index.php / tiobe_index. 2011. Platforms . Moore’s Law. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Rapid Identification of Architectural Bottlenecks via Precise Event

Counting

John Demme, Simha SethumadhavanColumbia University

{jdd,simha}@cs.columbia.edu

Page 2: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

2002

CASTL: Computer Architecture and Security Technologies Lab 2

Java

C

C++Visual Basic

Perl

PHP

Javascript

DelphiPython

LispScheme

C#

Ob-jec-

tive-C

Other

Language PopularityPlatforms

Source: TIOBE Index http://www.tiobe.com/index.php/tiobe_index

Page 3: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

2011

CASTL: Computer Architecture and Security Technologies Lab 3

Java

C

C++

Visual BasicPerl

PHPJavascriptDelphi

PythonLispAda

SchemeC#

Ob-jec-

tive-C

RubyLua

GoOther

Language Popularity

Source: TIOBE Index http://www.tiobe.com/index.php/tiobe_index

Platforms

Multicore

Moore’s Law

Page 4: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

HOW CAN WE POSSIBLY KEEP UP?

CASTL: Computer Architecture and Security Technologies Lab 4

Page 5: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Architectural Lifecycle

Performance Data

Collection

Human Analysis

Architectural Improvement

CASTL: Computer Architecture and Security Technologies Lab 5

Page 6: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Performance Data Collection• Analytical Models

– Fast, but questionable accuracy• Simulation

– Often the gold standard– Very detailed information– Very slow

• Production Hardware (performance counters)– Very fast– Not very detailed

CASTL: Computer Architecture and Security Technologies Lab 6

Page 7: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Performance Data Collection• Analytical Models

– Fast, but questionable accuracy• Simulation

– Often the gold standard– Very detailed information– Very slow

• Production Hardware (Performance Counters)– Very fast– Not very detailed– Relatively detailed

CASTL: Computer Architecture and Security Technologies Lab 7

Page 8: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

ACCURACY, PRECISION & PERTURBATION

A comparison of performance monitoring techniquesand the uncertainty principal

CASTL: Computer Architecture and Security Technologies Lab 8

Page 9: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Accuracy, Precision & Perturbation

• In normal execution, program interacts with microarchitecture as expected

CASTL: Computer Architecture and Security Technologies Lab 9

Normal Program Execution

Corresponding Machine State (Cache, Branch Predictor, etc)

Time

Page 10: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Precise Instrumentation

• When instrumentation is inserted, the machine state is disrupted and measurements are inaccurate

CASTL: Computer Architecture and Security Technologies Lab 10

Monitored Program Execution

“Correct” Machine State (Cache, Branch Predictor, etc)

Measured Machine State (Cache, Branch Predictor, etc)Start of

mutex_lockStart of

mutex_unlockStart ofbarrier_wait

Time

Page 11: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Performance Counter SW LandscapePrecise

Reads counters whenever program or instrumentation requests a read

Heavyweight

Examples • PAPI• perf_event

Overhead • Proportional to # of reads

• PAPI: 1048ns• Perf_event:

262ns

CASTL: Computer Architecture and Security Technologies Lab 11

Page 12: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Sampling vs. Instrumentation

CASTL: Computer Architecture and Security Technologies Lab 12

Sampled Program Execution

n cycles n cycles

Traditional Instrumented Program Execution

Start ofmutex_lock

Start ofmutex_unlock

Start ofbarrier_wait

• Traditional instrumentation like polling• Sampling uses interrupts

Time

Page 13: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Performance Counter SW LandscapeSampling Precise

Interrupts every n cycles and extrapolates

Reads counters whenever program or instrumentation requests a read

Heavyweight

Examples • vTune• OProfile

• PAPI• perf_event

Overhead • Inversely proportional to n

• Up to 20%• Usually much less

• Proportional to # of reads

• PAPI: 1048ns• Perf_event:

262ns

CASTL: Computer Architecture and Security Technologies Lab 13

Page 14: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

The Problem with Sampling

CASTL: Computer Architecture and Security Technologies Lab 14

Sample Interrupt

Is this a critical section?

Page 15: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Corrected with Precision

CASTL: Computer Architecture and Security Technologies Lab 15

Read counter

Read counter

Page 16: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

But, Precision Adds Overhead

CASTL: Computer Architecture and Security Technologies Lab 16

Monitored Program Execution

“Correct” Machine State (Cache, Branch Predictor, etc)

Measured Machine State (Cache, Branch Predictor, etc)

Time

Page 17: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Instrumentation Adds Perturbation

• If instrumentation sections are short, perturbation is reduced and measurements become more accurate

CASTL: Computer Architecture and Security Technologies Lab 17

Monitored Program Execution

“Correct” Machine State (Cache, Branch Predictor, etc)

Measured Machine State (Cache, Branch Predictor, etc)

Time

Page 18: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Performance Counter SW LandscapeSampling Precise

Interrupts every n cycles and extrapolates

Reads counters whenever program or instrumentation requests a read

Heavyweight Lightweight

Examples • vTune• OProfile

• PAPI• perf_event

Overhead • Inversely proportional to n

• Up to 20%• Usually much less

• Proportional to # of reads

• PAPI: 1048ns• Perf_event:

262ns

CASTL: Computer Architecture and Security Technologies Lab 18

Page 19: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Performance Counter SW LandscapeSampling Precise

Interrupts every n cycles and extrapolates

Reads counters whenever program or instrumentation requests a read

Heavyweight Lightweight

Examples • vTune• OProfile

• PAPI• perf_event

• LiMiT

Overhead • Inversely proportional to n

• Up to 20%• Usually much less

• Proportional to # of reads

• PAPI: 1048ns• Perf_event:

262ns

• Proportional to # of reads

• 11ns

CASTL: Computer Architecture and Security Technologies Lab 19

Page 20: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Related Work• No recent papers for better precise

counting– Original PAPI paper: Browne et al. 2000– Some software, none offering LiMiT’s features

• Characterizing performance counters– Weaver & Dongarra 2010

• Sampling– Counter multiplexing techniques

• Mytkowicz et al. 2007• Azimi et al. 2005

– Trace Alignment• Mytkowicz et al. 2006

CASTL: Computer Architecture and Security Technologies Lab 20

Page 21: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

REDUCING COUNTERREAD OVERHEADS

Implementing lightweight, precise monitoring

CASTL: Computer Architecture and Security Technologies Lab 21

Page 22: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Why Precision is SlowAvoid system calls to avoid overheadPerfmon2 & Perf_event LiMiT

Program requests counter read

22CASTL: Computer Architecture and

Security Technologies Lab

Kernel reads counter and returns result

Program uses value

System Call

Syste

m R

et

Program reads counter

Program uses value

Why is thisso hard?

Page 23: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

A Self-Monitoring Process

CASTL: Computer Architecture and Security Technologies Lab 23

Page 24: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Run, process, run

CASTL: Computer Architecture and Security Technologies Lab 24

32439

5L1 MissesBranchesCycles

Page 25: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Overflow

CASTL: Computer Architecture and Security Technologies Lab 25

L1 MissesBranchesCycles

2439

7

95100

Psst!

Page 26: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Overflow

CASTL: Computer Architecture and Security Technologies Lab 26

L1 MissesBranchesCycles

247

00

L1 MissesBranchesCycles

00

0

Overflow Space

1

100

Page 27: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Modified Read

CASTL: Computer Architecture and Security Technologies Lab 27

L1 MissesBranchesCycles

247

20

L1 MissesBranchesCycles

00

Overflow Space

100

20100+120

Page 28: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Overflow During Read

CASTL: Computer Architecture and Security Technologies Lab 28

L1 MissesBranchesCycles

247

99

L1 MissesBranchesCycles

00

Overflow Space

0

99

Page 29: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Overflow!

CASTL: Computer Architecture and Security Technologies Lab 29

L1 MissesBranchesCycles

247

00

L1 MissesBranchesCycles

00

Overflow Space

0

1

100

99

Page 30: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Atomicity Violation!

CASTL: Computer Architecture and Security Technologies Lab 30

L1 MissesBranchesCycles

247

0

L1 MissesBranchesCycles

00

Overflow Space

100

99100+199

Page 31: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

OS Detection & Correction

CASTL: Computer Architecture and Security Technologies Lab 31

L1 MissesBranchesCycles

247

00

L1 MissesBranchesCycles

00

Overflow Space

0

1

100

99

Page 32: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

OS Detection & Correction

CASTL: Computer Architecture and Security Technologies Lab 32

L1 MissesBranchesCycles

247

00

L1 MissesBranchesCycles

00

Overflow Space

100

99Looks like

he was reading that…

0

Page 33: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Atomicity Violation Corrected

CASTL: Computer Architecture and Security Technologies Lab 33

L1 MissesBranchesCycles

247

0

L1 MissesBranchesCycles

00

Overflow Space

100

0100+100

So what does all this effort buy us?

Page 34: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Time to collect 3*107 readingsTime PAPI Perf_event LiMiT Speedup

User 1.26s 0.53s 0.034s 3.7x / 1.56xSystem 30.10s 7.30s 0 ∞

Wall 31.44s 7.87s 0.34s 92x / 23.1x

CASTL: Computer Architecture and Security Technologies Lab 34

Average LiMiT Readout

Number of instructions 5Number of cycles 37.14

Time 11.3 ns

Page 35: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

LiMiT Enables Detailed Study• Short counter reads decrease perturbation• Little perturbation allows detailed study of

– Short synchronization regions– Short function calls

• Three Case Studies– Synchronization in production web applications

• Not presented here, see paper– Synchronization changes in MySQL over time– User/Kernel code behavior in runtime libraries

CASTL: Computer Architecture and Security Technologies Lab 35

Page 36: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

CASE STUDY:LONGITUDINAL STUDY OF LOCKING

BEHAVIOR IN MYSQLHas MySQL gotten better since the advent of multi-cores?

CASTL: Computer Architecture and Security Technologies Lab 36

Page 37: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Evolution of Locking in MySQL

• Questions to answer– Has MySQL gotten better at locking?– What techniques have been used?

• Methodology– Intercept pthread locking calls– Count overheads and critical sections

CASTL: Computer Architecture and Security Technologies Lab 37

Page 38: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

MySQL Synchronization Times

CASTL: Computer Architecture and Security Technologies Lab 38

MySQL 4.1 (2004)

MySQL 5.0 (2005)

MySQL 5.1 (2008)

MySQL 5.5 (Beta, 2009)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

FreeLockingLock HeldUnlocking

Perc

enta

ge o

f Exe

cutio

n

Page 39: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

MySQL Critical Sections

CASTL: Computer Architecture and Security Technologies Lab 39

MySQL 4.1 (2004)

MySQL 5.0 (2005)

MySQL 5.1 (2008)

MySQL 5.5 (Beta, 2009)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

0

200

400

600

800

1000

1200

1400

Overall Time With Lock Held Avg. Lock Hold Time

Perc

enta

ge o

f Exe

cutio

nw

ith Lo

ck H

eld

Aver

age

Num

ber o

f Cyc

les

Lock

is H

eld

Page 40: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Number of Locks in MySQL

CASTL: Computer Architecture and Security Technologies Lab 40

MySQL 4.1 (2004)

MySQL 5.0 (2005)

MySQL 5.1 (2008)

MySQL 5.5 (Beta, 2009)

0E+00

1E+08

2E+08

3E+08

4E+08

5E+08

6E+08

0E+00

1E+05

2E+05

3E+05

4E+05

Dynamic Locks Static Locks

Dyna

mic

Lock

s

Stati

c Loc

ks

Page 41: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Observations & Implications• Coarser granularity, better performance

– Total critical section time has decreased– Average CS times have increased– Number of locks has decreased

• Performance counters useful for software engineering studies

CASTL: Computer Architecture and Security Technologies Lab 41

Page 42: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

CASE STUDY:KERNEL/USERSPACE OVERHEADS

IN RUNTIME LIBRARYDoes code in the kernel and runtime library behave?

CASTL: Computer Architecture and Security Technologies Lab 42

Page 43: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Full System Analysis w/o Simulation

• Questions to answer– How much time do system applications spend

in in runtime libraries?– How well do they perform in them? Why?

• Methodology– Intercept common libc, libm and libpthread

calls– Count user-/kernel- space events during the

calls– Break down by purpose (I/O, Memory, Pthread)

• Applications– MySQL, Apache

• Intel Nehalem MicroarchitectureCASTL: Computer Architecture and Security Technologies Lab 43

Page 44: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Execution Cycles in Library Calls

CASTL: Computer Architecture and Security Technologies Lab 44

MySQL (User) MySQL (Kernel) Apache (User) Apache (Kernel)0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

PthreadsMemoryI/O

Perc

enta

ge o

f Tot

al C

ycle

s

Page 45: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

MySQL Clocks per Instruction

CASTL: Computer Architecture and Security Technologies Lab 45

User Kernel Libc Program0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Cloc

ks p

er In

stru

ction

Page 46: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

L3 Cache MPKI

CASTL: Computer Architecture and Security Technologies Lab 46

MySQL (User) MySQL (Kernel) Apache (User)0

0.20.40.60.8

11.21.41.61.8

2

I/O Memory Pthreads

L3 M

PKI

Apache (K...0

5

10

15

20

25

30

35

Page 47: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

I-Cache Stall Cycles

CASTL: Computer Architecture and Security Technologies Lab 47

MySQL (User) MySQL (Kernel) Apache (User) Apache (Kernel)0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

I/O Memory Pthreads

Perc

enta

ge o

f Tot

al C

ycle

s

22.4% 12.0%

Page 48: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Observations & Implications• Apache is fundamentally I/O bound

– Optimization of the I/O subsystem necessary

• Kernel code suffers from I-Cache stalls– Speculation: bad interrupt instruction

prefetching

• LiMiT yields detailed performance data– Not as accurate or detailed as simulation– But gathered in hours rather than weeks

CASTL: Computer Architecture and Security Technologies Lab 48

Page 49: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

CONCLUSIONSResearch Methodology Implications,

Closing thoughts

CASTL: Computer Architecture and Security Technologies Lab 49

Page 50: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Conclusions• Implications from case studies

– MySQL’s multicore experience helped scalability

– Performance counting for non-architecture– Libraries and kernels perform very differently– I/O subsystems can be slow

• Research Methodology– LiMiT can provide detailed results quickly– Simulators are more detailed but slow– Opportunity to build microbenchmarks

• Identify bottlenecks with counters• Verify representativeness with counters• Then simulate

CASTL: Computer Architecture and Security Technologies Lab 50

Page 51: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

QUESTIONS?

CASTL: Computer Architecture and Security Technologies Lab 51

Page 52: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

BACKUP SLIDESMan down! Need backup!

CASTL: Computer Architecture and Security Technologies Lab 52

Page 53: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Performance Evaluation MethodsAccuracy Precision Speed Cost

Simulators ↑ ↑ ↓ ↑/↓Analytical Models ? ? ↑ ↓Prototype Hardware ↑ ↑ ↑ ↑ProductionHardware ↑/↓ ↑/↓ ↑ ↓

Accuracy and Precisionare traded off

• Production hardware provides performance counters• However, existing interfaces make accuracy/precision tradeoff difficult

53CASTL: Computer Architecture and

Security Technologies Lab

Page 54: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Sampling vs. LiMiT

CASTL: Computer Architecture and Security Technologies Lab 54

Sampled Program Execution

n cycles n cycles

LiMiT Instrumented Program Execution

Start ofmutex_lock

Start ofmutex_unlock

Start ofbarrier_wait

Page 55: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Another process runs

CASTL: Computer Architecture and Security Technologies Lab 55

MilesPushupsSitups

52439

79

Page 56: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Fix: Virtualization

CASTL: Computer Architecture and Security Technologies Lab 56

MilesPushupsSitups

2439

3030 Miles!

I did pretty well today.

No you didn’t.

7

Page 57: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

MilesPushupsSitups

2439

7

Avoiding Communication

CASTL: Computer Architecture and Security Technologies Lab 57

MilesPushupsSitups

00

30

Page 58: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

LiMiT Operation

CASTL: Computer Architecture and Security Technologies Lab 58

Page 59: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

RDTSC

CASTL: Computer Architecture and Security Technologies Lab 59

Page 60: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

MySQL Instrumentation Overhead

CASTL: Computer Architecture and Security Technologies Lab 60

None LiMiT perf_event PAPI0.00E+00

5.00E+11

1.00E+12

1.50E+12

2.00E+12

2.50E+12

MySQL Execution Cycles (User Time)

Page 61: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

CASE STUDY A:LOCKING IN WEB WORKLOADS

How does web-related software use locks?

CASTL: Computer Architecture and Security Technologies Lab 61

Page 62: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Locking on the Web• Questions to answer

– Is locking a significant concern?– How can architects help?– Are traditional benchmarks similar?

• Methodology– Intercept pthread mutex calls, time w/ LiMiT

• Applications– Firefox– Apache– MySQL– PARSEC

CASTL: Computer Architecture and Security Technologies Lab 62

Page 63: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Execution Time by Region

CASTL: Computer Architecture and Security Technologies Lab 63

Firefox LiMiT

Apache LiMiT

Parsec LiMiT

MySQL LiMiT

Apache PAPI

Parsec PAPI

MySQL PAPI

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

FreeLockLock HeldUnlock

Perc

enta

ge o

f Tot

al U

ser C

ycle

s

Page 64: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Locking StatisticsFirefox Apache PARSEC MySQL

Avg. Lock Held Time (cycles) 789 149 118 1076

Dynamic Locks per 10k Cycles 3.24 1.12 0.545 3.18

Static Locks 57 1 17 13853

CASTL: Computer Architecture and Security Technologies Lab 64

Page 65: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Observations & Implications• Applications like Firefox and MySQL use

locks differently from Apache and PARSEC– Many notions of synchronization based on

scientific computing probably don’t apply• Locking overheads up to 8 - 13%

– More efficient mechanisms may be helpful– But, 13% is upper bound on speedup

• MySQL has some very long critical sections– Prime targets for micro-arch optimization– If they run faster, MySQL scales better

CASTL: Computer Architecture and Security Technologies Lab 65

Page 66: Rapid Identification of Architectural Bottlenecks via Precise Event Counting

Hardware Enhancements• 64-bit Reads and Writes

– Overflows are primary source of complexity– 64-bit counters w/ full read/write eliminates it

• Destructive Reads– Difference = 2 reads, store, load & subtract– Destructive read difference = 2 reads

• Combined Reads– X86 counter read requires 2 instructions– Combining should reduce overhead

• AMD’s Lightweight Profiling Proposal– Really good, depending on microarchitecture

CASTL: Computer Architecture and Security Technologies Lab 66