analytical performance modeling of hierarchical ... · pdf fileanalytical performance modeling...
TRANSCRIPT
![Page 1: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/1.jpg)
Analytical Performance Modeling of Hierarchical Interconnect Fabrics
Nikita Nikitin, Javier de San Pedro, Josep Carmona and Jordi Cortadella
Universitat Politècnica de Catalunya
Supported by Intel Corporation
International Symposium on Networks-on-Chip (NOCS) 2012, Copenhagen, Denmark
![Page 2: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/2.jpg)
Outline • Introduction
– Hierarchical Chip Multiprocessors (CMPs)
– Performance modeling for CMPs
– The cyclic dependency between latency and traffic
• Analytical performance modeling
– Modeling traffic
– Modeling latency
– Methods to resolve the dependency
• Results and conclusions
NOCS'12 Universitat Politècnica de Catalunya 2
![Page 3: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/3.jpg)
The trends in CMP design • Hundreds of computing units per chip
– Smaller, simpler, more power-efficient cores
• Advanced memory management – Larger on-chip cache
– Increasing interconnect (IC) bandwidth
• Tiled architecture
NOCS'12 Universitat Politècnica de Catalunya 3
R R R R
R R R R
R R R R
R R R R Mem
ory
Co
ntr
olle
r
Mem
ory
Co
ntr
olle
r
C
L2 R
L1
![Page 4: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/4.jpg)
Hierarchical interconnects
4
C+L1
L2
C+L1
L2
L3
IC ( Bus / Ring )
NI
R
Dir
NOCS'12 Universitat Politècnica de Catalunya
Tiled CMP with hierarchical interconnect
R
R R
Mem
ory
Co
ntr
olle
r
Mem
ory
Co
ntr
olle
r
IC
R
IC
IC IC
R R
R
• Exploit locality of memory references*
* “Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs”, R.Das et al., HPCA, 2009
![Page 5: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/5.jpg)
Design of CMP architecture • Goal: efficient use of chip resources
– Maximize performance
– Fit area/power/thermal budget
• Multidimensional exploration space
(#cores / cache size /
memory hierarchy / IC topologies /…)
• Means: automated design space exploration
– Analytical performance models are essential
NOCS'12 Universitat Politècnica de Catalunya 5
C C
L3
R
D
R
MC
MC
IC
R
IC
R
R
IC
R
IC
R
![Page 6: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/6.jpg)
Contention modeling
• Contention impacts CMP performance
• Crucial evaluating hierarchical interconnects
– Is the required bandwidth sustainable?
NOCS'12 Universitat Politècnica de Catalunya 6
R
R R
Mem
ory
Co
ntr
olle
r
Mem
ory
Co
ntr
olle
r
IC
R
IC
IC IC
R R
R
# of wires? Router architecture?
Local IC topology?
![Page 7: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/7.jpg)
Motivational example
NOCS'12 Universitat Politècnica de Catalunya 7
(a) 8x8 mesh (b) 4x4 mesh with bus clusters
(c) 2x2 mesh with bus clusters
Estimation w/o contention is very
inaccurate!
48 cores, 16 cache modules core cache IC Legend:
0
2
4
6
8
10
(a) (b) (c)
Thro
ugh
pu
t (I
PC
)
No contention
With contention
![Page 8: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/8.jpg)
Analytical modeling of CMP performance
• Analytical models for ICs: – Latency L as a function of traffic λ
– λ defined by the workload
Emphasis: λ depends on L!
• This work: resolve the cyclic dependency of traffic and latency – Formulate λ as a function of L
– Add existing model for L(λ)
– Resolve the system efficiently
NOCS'12 Universitat Politècnica de Catalunya 8
L λ IPC
Core1
Corei
CoreN
…
Li λi
Memory subsystem
L L •••
(Throughput)
![Page 9: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/9.jpg)
Outline • Introduction
– Hierarchical Chip Multiprocessors (CMPs)
– Performance modeling for CMPs
– The cyclic dependency between latency and traffic
• Analytical performance modeling
– Modeling traffic
– Modeling latency
– Methods to resolve the dependency
• Results and conclusions
NOCS'12 Universitat Politècnica de Catalunya 9
![Page 10: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/10.jpg)
Modeling memory traffic
Traffic to memory (probability of a memory reference per cycle):
NOCS'12 Universitat Politècnica de Catalunya 10
Average latency of memory access Memory access penalty
Core L λ
Memory
subsystem
Parameters of core executing some workload: 1. - ideal Cycles Per Instruction
2. - # Memory references Per Instruction
Real performance of in-order core:
![Page 11: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/11.jpg)
Modeling average memory latency • Average latency of memory requests for a core:
NOCS'12 Universitat Politècnica de Catalunya 11
0
0,05
0,1
0,15
0,2
0,25
0 5 10 M
iss
Rat
io
Cache size (Mb)
Latencies are calculated using - Cache latencies - Interconnect topology - Routing algorithm (XY)
Probabilities are calculated using - Miss ratio dependency on cache size
Application
15% miss in 64K L1
5% miss in 1M L2
0
0,1
0,2
0,3
0,4
0 5 10 M
iss
Rat
io
Cache size (Mb)
Application
![Page 12: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/12.jpg)
Modeling contention latency
NOCS'12 Universitat Politècnica de Catalunya 12
CL
MC
MC
R
CL
R
CL
R
CL
R
R
C C
L3
NI
D
Mesh NoC Bus-based cluster
Delays in queues are defined by extending M/G/1 queuing model:
“An Analytical Approach for Network-on-Chip Performance Analysis”, Ogras et al., TCAD, 2010 (Best Paper Award)
![Page 13: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/13.jpg)
System of non-linear equations
• Solve using numerical methods
• General methods are very slow – 10x10 mesh (10K vars./eqns.) – MATLAB timeout after few hours
• Proposed methods: – Fixed-point iteration
– Bisection search for λ
The cyclic dependency of L and λ
NOCS'12 Universitat Politècnica de Catalunya 13
Any “black-box” model for L(λ)!
Analytical model for latency
…
…
![Page 14: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/14.jpg)
Fixed-point iteration
+ Fast (10x10 mesh in several ms)
+ Converges to the exact solution
NOCS'12 Universitat Politècnica de Catalunya 14
0
10
20
30
40
50
0 0,05 0,1 0,15 0,2
L, a
vera
ge la
ten
cy (
cycl
es)
λ, average traffic rate (flits/cycle)
L(λ) λ (L)
Characteristic of the IC Characteristic of
the cores/workload
– May not converge for high λ
Hop-count latency
![Page 15: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/15.jpg)
Bisection search for λ
– Fast, as fixed-point
– Always converges to an approximate solution
(good for homogeneous clusters)
NOCS'12 Universitat Politècnica de Catalunya 15
0
10
20
30
40
50
0 0,05 0,1 0,15 0,2
L, a
vera
ge la
ten
cy (
cycl
es)
λ, average traffic rate (flits/cycle)
L(λ) λ (L)
Characteristic of the IC Characteristic of
the cores/workload
λ=0 λ(Lhop-count)
![Page 16: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/16.jpg)
Outline • Introduction
– Hierarchical Chip Multiprocessors (CMPs)
– Performance modeling for CMPs
– The cyclic dependency between latency and traffic
• Analytical performance modeling
– Modeling traffic
– Modeling latency
– Methods to resolve the dependency
• Results and conclusions
NOCS'12 Universitat Politècnica de Catalunya 16
![Page 17: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/17.jpg)
Performance of analytical methods
Test Mesh Cont. lat. Num. of var./eqn.
Runtime (sec)
MATLAB Fixed-Point Bisection
T1 2 x 2 5% 236 0.023 0.001 0.001
T2 4 x 4 13% 1224 1.412 0.001 0.002
T3 6 x 6 8% 3108 30.831 0.002 0.003
T4 8 x 8 12% 6128 408.539 0.006 0.010
T5 10 x 10 23% 10260 Timeout (1hr) 0.010 0.012
T6 10 x 10 46% 10260 Timeout (1hr) 0.022 0.015
T7 10 x 10 55% 10260 Timeout (1hr) NA 0.016
NOCS'12 Universitat Politècnica de Catalunya 17
![Page 18: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/18.jpg)
Case study: performance exploration
NOCS'12 Universitat Politècnica de Catalunya 18
Parameter Value
Chip area Core area Core IPC0
MPI L1 size L2 size Memory density Mesh dimensions MC latency
350 mm2
1.25 mm2
2.0 0.5 64, 128 Kb 64 Kb to 3 Mb 1 mm2 / Mb 2x2 to 16x16 100 cycles
0
0,05
0,1
0,15
0,2
0,25
0 2 4 6 8 10
Mis
s R
atio
Cache size (Mb)
1062 configurations explored
Cache Size 64K 128K 256K 512K 1M 2M 4M 8M
Area* (mm2) 0.063 0.125 0.25 0.5 1.0 2.0 4.0 8.0
Latency (cycles) 2 3 4 5 6 7 8 9
![Page 19: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/19.jpg)
Simulation environment
NOCS'12 Universitat Politècnica de Catalunya 19
Network simulation
Global (mesh)
memory L3 cache
node
Bus Local (bus, ring, …)
Core
Memory
controller
• Verify model by simulation
• Cycle-accurate NoC simulator – On top of BookSim 2.0
• Extensions – Hierarchical networks
– Bus topologies
– Probabilistic state-machines
for cores and memories
![Page 20: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/20.jpg)
Faithfulness of the model
20
0
5
10
15
20
25
30
35
1
52
10
3
15
4
20
5
25
6
30
7
35
8
40
9
46
0
51
1
56
2
61
3
66
4
71
5
76
6
81
7
86
8
91
9
97
0
10
21
Thro
ugh
pu
t (I
PC
)
Configurations sorted in descending order of throughput
Modeling
Simulation
• Average difference in throughput is about 10%
• Corresponds to the error of the latency model
NOCS’12 Universitat Politècnica de Catalunya
![Page 21: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/21.jpg)
Best-throughput ordering
21
Simulation time: 5.5 hours Modeling time: 16.8 sec (>1000x faster)
0
200
400
600
800
1000
0 200 400 600 800 1000 B
est
co
nfi
gura
tio
ns
by
anal
ysis
th
at in
clu
de
N
Number of best config. by simulation (N)
Static latency
Full latency
Ideal (Simulation)
(1; 33)
(4; 44)
(1; 2) (4; 6)
(50; 64)
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60 Be
st c
on
figu
rati
on
s b
y an
alys
is t
hat
incl
ud
e N
Number of best configurations by simulation (N)
Static latency
Full latency
Ideal (Simulation)
NOCS’12 Universitat Politècnica de Catalunya
No contention
With contention
No contention
With contention
Ideal (Simulation)
![Page 22: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/22.jpg)
Conclusions
• Analytical modeling of contention in CMPs is essential
• There exists cyclic dependency between latency and traffic of memory requests
• This dependency can be efficiently resolved using numerical methods (fixed-point, bisection)
• Precision of the model is significantly improved
• Current work: out-of-order cores, heterogeneity
NOCS'12 Universitat Politècnica de Catalunya 22
![Page 23: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/23.jpg)
Backup
NOCS'12 Universitat Politècnica de Catalunya 23
![Page 24: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/24.jpg)
Sufficient for convergence of :
0
10
20
30
40
50
0 0,05 0,1 0,15 0,2
L, a
vera
ge la
ten
cy (
cycl
es)
λ, average traffic rate (flits/cycle)
L(λ) λ (L)
Fixed-point convergence issues
NOCS'12 Universitat Politècnica de Catalunya 24
Hop-count latency
![Page 25: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/25.jpg)
Bisection search
NOCS'12 Universitat Politècnica de Catalunya 25
Latency model Traffic model
![Page 26: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/26.jpg)
Average latency calculation
• Average Memory Access Time (AMAT):
NOCS'12 Universitat Politècnica de Catalunya 26
![Page 27: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/27.jpg)
Best configuration
27
- 6x6 mesh, 36 clusters, 5 cores/cluster
- total 180 cores with 64K L1, 256K L2
- 68Mb total shared L3
Throughput = 30.81 IPC
R R R R R R
R R R R R R
R R R R R R
R R R R R R
R R R R R R
R R R R R R
Mem
ory
Co
ntr
olle
r
Mem
ory
Co
ntr
olle
r
Memory Controller
Memory Controller
C+L1
L2
C+L1
L2
C+L1
L2
L3
Bus
C+L1
L2
C+L1
L2
NI
R
Dir
NOCS'12 Universitat Politècnica de Catalunya
![Page 28: Analytical Performance Modeling of Hierarchical ... · PDF fileAnalytical Performance Modeling of Hierarchical Interconnect Fabrics ... Tiled CMP with hierarchical interconnect R er](https://reader031.vdocuments.site/reader031/viewer/2022030415/5aa0ec847f8b9a62178ed1a4/html5/thumbnails/28.jpg)
Runtime: Modeling vs Simulation
0,001
0,01
0,1
1
10
100
1000
0 100 200 300 400 500 600 700 800 900 1000 1100
Ru
nti
me
(se
con
ds)
Number of components (cores + memories) in CMP
Analytical
Simulation
Modeling a CMP with ~700 components in 1 second
28 NOCS’12 Universitat Politècnica de Catalunya