![Page 1: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/1.jpg)
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energyʼs National Nuclear Security Administration
under contract DE-AC04-94AL85000.
Richard C. MurphySandia National Laboratories
DOE Institute for Advanced [email protected]
March 11, 2009
Can We Continue to Build Supercomputers Out of Processors Optimized for Laptops?
![Page 2: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/2.jpg)
The Memory Wall• Historically:
– Processor cycle time decreased FASTER than memory access access latency
• Technologically, this may not hold as cycle times go flat
• Definitely holds as memory hierarchies become more complex– Multicore exacerbates this
problem• Enhancements to “compute” not as effective as decreasing memory latency for problems that don’t fit in cache!
0
0.5
1
1.5
2
2.5
Low Latency
Memory Enhancements
Half LatencyFU
DFS IPC
Processor Enhancements
BranchBase
IPC
Without PrefetchingWith Prefetching
![Page 3: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/3.jpg)
Motivating Example
It is singular how soon we lose the impression of what ceases to be constantly before us.
- Lord Byron
![Page 4: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/4.jpg)
Consider THE Textbook 5-Stage Pipeline
EX MEMIF REG WB
• The Stages Are:– IF: Instruction Fetch– REG: Register File Read– EX: Execute– MEM: Memory Access– WB: Write Back (to register
file)• Real pipelines are more complex– We will assume each stage
takes one clock cycle
![Page 5: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/5.jpg)
Add An L1 Cache
EX MEMIF REG WB
L1Inst.
L1Data
![Page 6: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/6.jpg)
An L2 Cache
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
![Page 7: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/7.jpg)
And Main Memory
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main MemoryNow we have a computer (circa 1995)
This is looks a lot like the basic building block for today’s machines, but they have
- Many of these- Deeper pipelines- More complex memory hierarchies
![Page 8: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/8.jpg)
But latency’s complicated...
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory O(100ns) - 200 clocks at 2GHz
O(10ns) - 20 clocks
O(2ns) - 4 clocks
O(<<1ns) - 1 clock
![Page 9: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/9.jpg)
Example Code - Sparse Matrix Vector Multipllyvoid matvec(int nnz, int n, double A_values[], int A_indices[], int A_offsets[], double x[], double y[]) {// Computes y = A*x: A is a sparse square matrix of dimension n w/ nnz nonzeros,// x is a vector of known values and y (on exit) contains the result of A*x.// nnz - Number of nonzero entries in sparse matrix A.// n – Dimension of A, x and y.// A_values – Nonzero matrix values stored contiguously row by row.// A_indices – Column indices with matrix entries stored in A_values.// A_offsets – Offsets of each row into A_values and A_indices. // x – Input vector// y – Output vector.
int jstart = 0; int jstop = A_offset[0]; for (int i=0; i<n; ++i) { jstart = jstop; jstop = A_offset[i+1]; double sum = 0.0; for (int j=jstart; j<jstop; ++j) sum+= A_values[j]*x[A_indices[j]]; y[i] = sum; }
return;}
Thanks to Mike Heroux, SNL for making this a real example
![Page 10: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/10.jpg)
void matvec(int nnz, int n, double A_values[], int A_indices[], int A_offsets[], double x[], double y[]) {// Computes y = A*x: A is a sparse square matrix of dimension n w/ nnz nonzeros,// x is a vector of known values and y (on exit) contains the result of A*x.// nnz - Number of nonzero entries in sparse matrix A.// n – Dimension of A, x and y.// A_values – Nonzero matrix values stored contiguously row by row.// A_indices – Column indices with matrix entries stored in A_values.// A_offsets – Offsets of each row into A_values and A_indices. // x – Input vector// y – Output vector.
int jstart = 0; int jstop = A_offset[0]; for (int i=0; i<n; ++i) { jstart = jstop; jstop = A_offset[i+1]; double sum = 0.0; for (int j=jstart; j<jstop; ++j) sum+= A_values[j]*x[A_indices[j]]; y[i] = sum; }
return;}
Thanks to Mike Heroux, SNL for making this a real example
Consider just the inner loop
Example Code - Sparse Matrix Vector Multiplly
![Page 11: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/11.jpg)
• Assume 32-bit address space• Registers
– Base Address of A_values in $A_values
– Base Address of A_indices in $A_indices
– Base Address of x in $x• Basic Values
– sum in $sum– x in $x
• Temporary values labeled $t0...$tn• Temporary Addresses labeled $a0...$an
Pseudo-Assemblysum += A_values[j]*x[A_indices[j]];
; compute j*4 for pointer mathslli $j_tmp, $j, 4 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 4 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumadd $sum, $sum, $t3
![Page 12: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/12.jpg)
• Real Floating Point Apps don’t do many FLOPS– 2 FLOPs (mulf, addf)– 5 integer ops to compute addresses
(slli, add, add, slli, add)– 3 memory instructions (loads)
• More if you have to write back sum before the end of the loop
• Real apps at Sandia– Usually < 10% FP– 40-50% mem – 30-40% integer– 10% branch
Observations; compute j*4 for pointer mathslli $j_tmp, $j, 4 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 4 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
Sandia FP SPEC FP Sandia Int SPEC INT0
10
20
30
40
50
60
70
80
90
100Mean Instruction Mix
Perc
ent
Integer ALU FP Branch Load Store
![Page 13: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/13.jpg)
Observations (continued)• Arun Rodrigues points out that we’ve talked a lot about floating point “resilience”, and the potential ability to live with less accuracy, but your state bits are equally likely to be in error:– Inaccurate integer address calculations would be bad (40% of
instructions)– Inaccurate branching would be worse (10% of instructions)
• if(my_fp_calculation_was_in_error)...– I don’t even know what an inaccurate load or store would be (40%
of instructions), but it would be bad– Inaccurate instructions could really cause havoc (100% of
instructions)
• As a computer architect, the idea of giving the wrong answer to improve performance makes me feel dirty
![Page 14: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/14.jpg)
• If the stride one accesses are large enough, they likely miss both caches to memory– A_values[j]– A_indices[j]
• X[A_indices[j]] introduces a data-dependent memory reference– A_indices[j] is required before the
X value can be loaded– Exhibits some temporal and
spatial locality
Observations (continued); compute j*4 for pointer mathslli $j_tmp, $j, 4 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 4 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
![Page 15: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/15.jpg)
Cycle 1
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
![Page 16: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/16.jpg)
Cycle 2
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
![Page 17: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/17.jpg)
Cycle 3
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
![Page 18: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/18.jpg)
Cycle 4
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
![Page 19: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/19.jpg)
Cycle 5
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
![Page 20: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/20.jpg)
Cycle 6
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load $t0, 0($a0) will MISS to memory!
![Page 21: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/21.jpg)
Cycle 7
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (1)
![Page 22: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/22.jpg)
Cycle 8
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (2)
load $t1, 0($a1) will MISS to memory!
![Page 23: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/23.jpg)
Cycle 9
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
slli depends on A_indices[j]
load (3)load (1) /slli pending
![Page 24: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/24.jpg)
Cycle 9
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
add depends on A_indices[j]
load (4)load (2) /slli / add pending
![Page 25: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/25.jpg)
Cycle 10
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load depends on A_indices[j]
load (3) /slli / add / load pending
load (1)
![Page 26: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/26.jpg)
Cycle 11
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
multf depends on both loads!
load (4) /slli / add / load pending
Next Iteration(with perfect branch prediction)
load (2)
![Page 27: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/27.jpg)
Cycle 12
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
addf depends on the multf
load (3)
load (1) /slli / add / load / multf / addf pending
Next Iteration(with perfect branch prediction)
![Page 28: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/28.jpg)
Cycle 13
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
addf depends on the multf
load (4)
load (2) /slli / add / load / multf / addf pending
Next Iteration(with perfect branch prediction)
![Page 29: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/29.jpg)
Cycle 29
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (1)
load (18) /slli / add / load / multf / addf pending
Finally, the first load misses to memory
![Page 30: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/30.jpg)
Cycle 30
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (2)
load (19) /slli / add / load / multf / addf pending
Finally, the first load misses to memory
![Page 31: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/31.jpg)
Cycle 31
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (3)load (1) /slli / add / load / multf / addf pending
Finally, the first load misses to memory
![Page 32: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/32.jpg)
Cycle 229
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (197) /slli / add / load / multf / addf pending
The first load returns to finish executing
![Page 33: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/33.jpg)
Cycle 230
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (198) /slli / add / load / multf / addf pending
The first load returns to finish executing
![Page 34: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/34.jpg)
Cycle 231
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (199) /slli / add / load / multf / addf pending
The first load returns to finish executing
![Page 35: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/35.jpg)
Cycle 232
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
addloadmultfaddf pending
The second load completes
![Page 36: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/36.jpg)
Cycle 233
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
loadmultfaddf pending
Finally, the first load is retired!
![Page 37: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/37.jpg)
Cycle 234
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
multfaddf pending
The second load retires...
![Page 38: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/38.jpg)
Cycle 235
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
multfaddf pending
Now we issue the data dependent load
![Page 39: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/39.jpg)
Cycle 236
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (1)multfaddf pending
And wait...
![Page 40: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/40.jpg)
Cycle 239
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (1)multfaddf pending
And wait... and wait...
![Page 41: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/41.jpg)
Cycle 258
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
load (1)multfaddf pending
And wait... and wait... and wait...
![Page 42: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/42.jpg)
Cycle 457
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
multfaddf pending
![Page 43: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/43.jpg)
Cycle 458
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
addf pending
The multif is ~4 cycles
![Page 44: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/44.jpg)
Cycle 461
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
The addf is ~2 cycles
![Page 45: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/45.jpg)
Cycle 462
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
The addf is ~2 cycles
![Page 46: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/46.jpg)
Cycle 463
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
![Page 47: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/47.jpg)
Cycle 464
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
![Page 48: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/48.jpg)
Cycle 465
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Main Memory; compute j*4 for pointer mathslli $j_tmp, $j, 2 ; j*4
; $t0 <= A_values[j]add $a0, $A_values, $j_tmpload $t0, 0($a0)
; $t1 <= A_indices[j]add $a1, $A_indices, $j_tmpload $t1, 0($a1)slli $t1, $t1, 2 ; $t1*4
; t2 <= x[$t1]add $a2, $x, $t1load $t2, 0($a2)
; t3 <= $t0 * $t2mulf $t3, $t0, $t2
; add to sumaddf $sum, $sum, $t3
DONE!
![Page 49: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/49.jpg)
Observations• Completing 10 instructions took 465 cycles!
– Nearly 1/4 microseconds of real time at 2 GHz• The FP multiplication only took 2ns• Yes, I can un-role the loop and pipeline multiple iterations
– However, only so many loads can be outstanding at a time• I haven’t talked about multicore/cache coherency yet
![Page 50: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/50.jpg)
What about a cache coherent model?
Main Memory
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
![Page 51: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/51.jpg)
What about a cache coherent model?
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
load $t2, 0($a2)
Core 1’s MemoryCore 0’s Memory
0($a2)’s home node 0($a2) being modified here
25 ns
![Page 52: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/52.jpg)
Cycle 5
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Core 1’s MemoryCore 0’s Memory
25 ns
load $t2, 0($a2)
![Page 53: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/53.jpg)
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Core 1’s MemoryCore 0’s Memory
25 ns
load $t2, 0($a2)
Cycle 6
![Page 54: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/54.jpg)
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Core 1’s MemoryCore 0’s Memory
25 ns
0($a2) is owned by Core 0 but ... Core 1 is modifying it
load $t2, 0($a2)
Cycle 25
![Page 55: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/55.jpg)
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Core 1’s MemoryCore 0’s Memory
25 ns
0($a2) is owned by Core 0 but ... Core 1 is modifying it
load $t2, 0($a2)
Cycle 25 (continued)
Send Invalidate Request
![Page 56: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/56.jpg)
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Core 1’s MemoryCore 0’s Memory
25 ns
0($a2) is owned by Core 0 but ... Core 1 is modifying it
load $t2, 0($a2)
Cycle 75
20 clock cycle L2 access
![Page 57: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/57.jpg)
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Core 1’s MemoryCore 0’s Memory
25 ns
0($a2) is owned by Core 0 but ... Core 1 is modifying it
load $t2, 0($a2)
Cycle 95
Flush the modified line back to the owner
![Page 58: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/58.jpg)
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Core 1’s MemoryCore 0’s Memory
25 ns
0($a2) is owned by Core 0 but ... Core 1 is modifying it
load $t2, 0($a2)
Cycle 145
Flush the modified line back to the owner
![Page 59: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/59.jpg)
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Core 1’s MemoryCore 0’s Memory
25 ns
load $t2, 0($a2)
Cycle 146
![Page 60: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/60.jpg)
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
EX MEMIF REG WB
L1Inst.
Unified L2
L1Data
Core 1’s MemoryCore 0’s Memory
25 ns
load $t2, 0($a2)
Cycle 147
![Page 61: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/61.jpg)
Observations• The “memory” latency in this example is 140 clock cycles
– “Closer” than DRAM– Optimistic because it was serviced by L2 cache
• Latency is “additive”– Each level of hierarchy serves “hits” faster but misses are slower– If I do a “read” owned by another node
• 10ns L2 cache miss• 25ns request over the on-chip network• 10ns L2 cache miss by the owner• 100ns memory read by the owner• 25ns response over the on-chip network• TOTAL: 170ns!
• On-chip network latency will increase as the number of cores increase (not linearly, but...)
![Page 62: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/62.jpg)
What do I do about it?
• Concurrency:– More cores may make the memory bus busier– Vendors are putting fewer channels/memory controllers per core in
each generation– Decreasing on a per-core basis!
• Latency– Relatively increasing or flat
• How do you fill those cycles?– Traditionally: Out-of-Order Execution– Today: More Threads
ConcurrencyLatency = ThroughputLittle’s Law:
![Page 63: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/63.jpg)
VLSI Designer’s View of the World
![Page 64: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/64.jpg)
VLSI Designer’s View of the World
Cores
![Page 65: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/65.jpg)
VLSI Designer’s View of the World
Cores
Cache
![Page 66: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/66.jpg)
VLSI Designer’s View of the World
Cores
Cache
Core-to-Core Communication
![Page 67: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/67.jpg)
VLSI Designer’s View of the World
Cores
Cache
Core-to-Core Communication
Off-chip I/O
![Page 68: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/68.jpg)
Application Impact
.25
.5
1.0
2.0
4.0
.25
.5
1.0
2.0
4.0
0
0.5
1
1.5
Relative Bandwidth
Average Sandia FP Latency and Bandwidth vs. Performance
Relative Latency
IPC
Physics Applications
.25
.5
1.0
2.0
4.0
.25
.5
1.0
2.0
4.0
0
0.5
1
1.5
Relative Bandwidth
Average Sandia Int Latency and Bandwidth vs. Performance
Relative Latency
IPC
Informatics Applications
Informatics Apps ~3X more sensitive to latency than Physics
![Page 69: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/69.jpg)
Application Observations• The commodity path has produced power-inefficient architectures– We observe IPCs of ~.3 at 2.5 GHz– With an IPC of 1, the same system could be clocked at 750 MHz
• Doing so would decrease power requirements superlinearly
• Address generation is a bigger problem than FLOPS• We can decrease latency or increase effective bandwidth
– Many apps throw away 7/8ths of a cache line– Increase effective bandwidth by: Scatter/Gather to improve spatial
locality– Decrease latency by: tighter integration, advanced packaging, etc.
![Page 70: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/70.jpg)
What I didn’t talk about• Memory capacity/core
– Currently declining, probably not a good thing– Fewer pins/core devoted to memory– Slow, bus interface (point-to-point serial solves the problem)– Inefficient protocols (e.g., FBDIMM)– No virtualization in the memory system– Horrible materials (F4) for boards
![Page 71: Can We Continue to Build Supercomputers Out of ...Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of](https://reader030.vdocuments.site/reader030/viewer/2022040401/5e77286be2e59b1141039b05/html5/thumbnails/71.jpg)
Conclusions
What we talk about in a procurement What we should talk about
FLOPS Anything but FLOPS:• Memory Latency Dominates• Integer Operations More Frequent• Generally, “data movement”
Memory Bandwidth Effective BandwidthLatencyConcurrency in the memory system
Power Data Movement Efficiency
“Peaks” Sustained