l 2 ilp, dlp and tlp in modern...

19
6.888 P ARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES D ANIEL SANCHEZ AND JOEL EMER

Upload: others

Post on 18-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE

SPRING 2013

LECTURE 2

ILP, DLP AND TLP IN MODERN MULTICORES

DANIEL SANCHEZ AND JOEL EMER

Page 2: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Review: ILP Challenges

Clock frequency: getting close to pipelining limits

Clocking overheads, CPI degradation

Branch prediction & memory latency limit the practical benefits of out-of-order execution

Power grows superlinearly with higher clock & more OOO logic

Design complexity grows exponentially with issue width

Limited ILP Must exploit TLP and DLP

Thead-Level Parallelism: Multithreading and multicore

Data-Level Parallelism: SIMD

2

6.888 Spring 2013 - Sanchez and Emer - L02

Page 3: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Review: Memory Hierarchy 3

Caching: Reduce latency, energy, BW of memory accesses

Why multilevel?

Why not just on-chip memories?

How does parallelism impact latency/BW constraints?

Prefetching: Trade-off latency for bandwidth, energy, capacity (pollution)

6.888 Spring 2013 - Sanchez and Emer - L02

L3

6MB

Core

Main Memory

16GB

L1I

64KB

L1D

64KB

L2

256KB

Core

L1I

64KB

L1D

64KB

L2

256KB

Core

L1I

64KB

L1D

64KB

L2

256KB

Core

L1I

64KB

L1D

64KB

L2

256KB

4 cycles, 72GB/s

8 cycles, 48GB/s

26-31 cycles, 96GB/s

(24GB/s/core)

150 cycles, 21GB/s

Page 4: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Flynn’s Taxonomy 4

Single instruction Multiple instruction

Single data SISD MISD (?)

Multiple data SIMD MIMD

6.888 Spring 2013 - Sanchez and Emer - L02

Page 5: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

SIMD Processing 5

Same instruction sequence applies to multiple elements

Vector processing Amortize instruction costs (fetch, decode,

…) across multiple operations

Requires regular data parallelism (no or minimal divergence)

Exploiting SIMD:

Explicit & low-level, using vector intrinsics

Explicit & high-level, convey parallel semantics (e.g., foreach)

Implicitly: Parallelizing compiler infers loop dependencies

How easy is this in C++? Java?

6.888 Spring 2013 - Sanchez and Emer - L02

Page 6: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

SIMD Implementations 6

Modern CPUs: SIMD extensions & wider regs

SSE: 128-bit operands (4x32-bit or 2x64-bit)

AVX (2011): 256-bit operands (8x32-bit or 4x64-bit)

LRB (upcoming): 512-bit operands

Explicit SIMD: Parallelization performed at compile time

GPUs: Architected for SIMD from the ground up

32 to 64 32-bit floats

Implicit SIMD: Scalar binary, multiple instances always run in

lockstep

How to handle divergence?

6.888 Spring 2013 - Sanchez and Emer - L02

Page 7: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Multithreading: Options 7

Motivation: Hardware underutilized on stalls TLP to increase utilization

CGMT, SMT typically increase throughput with moderate cost, maintain single-thread performance

FGMT typically trades throughput and simplicity at the expense of single-thread performance

6.888 Spring 2013 - Sanchez and Emer - L02

Page 8: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Example 1: SMT (Nehalem) 8

SMT design choices: For each component,

Replicate, partition statically, or share

Tradeoffs? Complexity, utilization, interference & fairness

Example: Intel Nehalem

4-wide superscalar, 2-way SMT

Replicated: Register file, RAS predictor, large-page ITLB

Partitioned: Load buffer, store buffer, ROB, small-page ITLB

Shared: Instruction window, execution units, predictors, caches, DTLBs

SMT policies:

Fetch policies: Utilization vs fairness

Long-latency stall tolerance: Flushing vs stalling

6.888 Spring 2013 - Sanchez and Emer - L02

[See: “Exploiting Choice: Instruction

Fetch and Issue on an Implementable

Simultaneous Multithreading

Processor”, Tullsen et al, ISCA 96]

Page 9: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Example 2: FGMT (Niagara) 9

4 threads/core, round-robin scheduling

No branch prediction, minimal bypasses more stalls

Small L1 caches (can tolerate higher L1 miss rates)

But L2 is still large… performance with long-latency stalls?

6.888 Spring 2013 - Sanchez and Emer - L02

Page 10: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Example 3: Extreme FGMT (Tera MTA) 10

Use FGMT to hide all instruction latencies

Worst case instruction latency is 128 cycles 128 threads

Benefits: no interlocks, no bypass, and no cache

Problem: single-thread performance

GPUs also exploit high FGMT for latency tolerance (e.g., Fermi, 48-way MT)

Throughput-oriented functional units: Longer latency, deeply pipelined

Throughput-oriented memory system: Small caches, aggressive memory scheduler

6.888 Spring 2013 - Sanchez and Emer - L02

InstT1

InstT3

A B C F G H

A B C D E E G H

D E

A B E F G H C D

A B E F G H C D

InstT2

InstT4

A B E F G H C D

A B E F G H C D

A B E F G H C D

A B E F G H C D

A B E F G H C D

InstT5

InstT7

InstT6

InstT8

InstT1

Page 11: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Multithreading: How Many Threads? 11

With more HW threads:

Larger/multiple register files

Replicated & partitioned resources Lower utilization, lower single-thread performance

Shared resources Utilization vs interference and thrashing

Impact of MT/MC on memory hierarchy?

6.888 Spring 2013 - Sanchez and Emer - L02

[“Many-Core vs. Many-Thread Machines: Stay

Away From the Valley”, Guz et al, CAL 09]

Page 12: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Amdahl’s Law 12

Amdahl’s Law: If a change improves a fraction f of the

workload by a factor K, the total speedup is:

Not only valid for performance!

Energy, complexity, …

I/D/TLP techniques make different tradeoffs between K

and f

SIMD vs MIMD f and K?

6.888 Spring 2013 - Sanchez and Emer - L02

)1(/

1

Time

TimeSpeedup

after

before

fKf

Page 13: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Amdahls’ Law in the Multicore Era

[Hill & Marty, CACM 08]

Should we focus on a single approach to extract parallelism?

At what point should we trade ILP for TLP?

Assume a resource-limited multi-core

N base core equivalent (BCEs) due to area or power constraints

A 1-BCE core leads to performance of 1

A R-BCE core leads to performance of perf(R)

Assuming perf(R) = sqrt(R) in following drawings (Pollack’s rule)

How should we design the multi-core?

Select type & number of cores

Assume caches & interconnect are rather constant

Assume no application scaling (or equal scaling for seq/par portions)

13

6.888 Spring 2013 - Sanchez and Emer - L02

Page 14: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Three Multicore Approaches

Large Cores (R BCEs/core)

Simple Cores (1 BCE/core)

Number Performance Number Performance

Symmetric CMP N/R Seq: Perf(R) Par: N/R*Perf(R)

- -

Asymmetric CMP

1 Seq: Perf(R) Par: Perf(R)

N-R Seq: - Par: N-R

Dynamic CMP 1 Seq: Perf(R) Par: -

N Seq: - Par: N

16 1-BCE cores Symmetric:

4 4-BCE cores

Asymmetric:

1 4-BCE core

& 12 1-BCE cores

Dynamic:

Adapt between

16 1-BCEs and 1 16-BCE

14

6.888 Spring 2013 - Sanchez and Emer - L02

Page 15: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Amdahl’s Law x3

Symmetric CMP

Asymmetric CMP

Dynamic CMP

Symmetric Speedup =

1

+ 1 - F

Perf(R)

F * R

Perf(R)*N

Asymmetric Speedup =

1

+ 1 - F

Perf(R)

F

Perf(R) + N - R

Dynamic Speedup =

1

+ 1 - F

Perf(R)

F

N

15

6.888 Spring 2013 - Sanchez and Emer - L02

Page 16: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

6.888 Spring 2013 - Sanchez and Emer - L01

Symmetric Multicore Chip

N = 256 BCEs

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256

Sym

me

tric

Sp

ee

du

p

R BCEs/core

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5 F=0.9

R=28 (vs. 2)

Cores=9 (vs. 8)

Speedup=26.7 (vs. 6.7)

CORE ENHANCEMENTS!

F1

R=1 (vs. 1)

Cores=256 (vs. 16)

Speedup=204 (vs. 16)

MORE CORES!

F=0.99

R=3 (vs. 1)

Cores=85 (vs. 16)

Speedup=80 (vs. 13.9)

CORE ENHANCEMENTS

& MORE CORES!

Results in parentheses N= 16

Higher N Higher R (more ILP) for fixed f

16

Page 17: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

6.888 Spring 2013 - Sanchez and Emer - L01

Asymmetric Multicore Chip

N = 256 BCEs

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256

Asy

mm

etri

c Sp

ee

du

p

R BCEs

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5 F=0.9

R=118 (vs. 28)

Cores= 139 (vs. 9)

Speedup=65.6

(vs. 26.7)

F=0.99

R=41 (vs. 3)

Cores=216 (vs. 85)

Speedup=166 (vs. 80)

Results in parentheses N= 16

Better speedups than symmetric

Software complexity?

17

Page 18: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

6.888 Spring 2013 - Sanchez and Emer - L01

Dynamic Multicore Chip

N = 256 BCEs

Results in parenthesis refer to N=16

Dynamic offers even higher speedups than asymmetric

SW and HW complexity?

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256

Dyn

amic

Sp

ee

du

p

R BCEs

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5

F=0.99

R=256 (vs. 41)

Cores=256 (vs. 216)

Speedup=223 (vs. 166)

Note:

#Cores

always

N=256

18

Page 19: L 2 ILP, DLP AND TLP IN MODERN MULTICOREScourses.csail.mit.edu/6.888/spring13/lectures/L2-multicore.pdf · LECTURE 2 ILP, DLP AND TLP IN MODERN MULTICORES DANIEL SANCHEZ AND JOEL

Readings for Wednesday

1. Is Dark Silicon Useful?

2. Dark Silicon and the End of Multicore Scaling

3. Single-Chip Heterogeneous Computing

19

6.888 Spring 2013 - Sanchez and Emer - L02