load value approximation - stacs lab · 2019-10-25 · relaxed confidence windows 34 ℎ i...
TRANSCRIPT
Load Value Approximation
Joshua San Miguel
Mario Badr
Natalie Enright Jerger
Accessing Memory
2
shared caches, directory, network-on-chip
main memory
processor core
L1 cache
Accessing Memory
3
shared caches, directory, network-on-chip
main memory
miss
processor core
L1 cache
Accessing Memory
4
shared caches, directory, network-on-chip
main memory
Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache!
processor core
L1 cache miss
Accessing Memory
5
shared caches, directory, network-on-chip
main memory
Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache!
Higher efficiency via Approximate Computing…
processor core
L1 cache miss
Approximate Computing
Not all computations need to be precise.
6
http://www.zentut.com/
http://www.businessweek.com/
http://www.cc.gatech.edu/~cnieto6/
http://www.analyticbridge.com/
http://themusicparlour.blogspot.ca/
http://www.scientific-computing.com/
Data mining
Computer vision Audio and video processing
Gaming Machine learning Dynamical simulation
Approximate Computing
7
execution time
energy
Approximate Computing
8
execution time
energy error
Approximate Computing
9
execution time
energy error
Approximate Computing
Many applications can tolerate approximate data. 40% to nearly 100% of data footprint is approximate
[Sampson, MICRO 2013].
10
Approximate Computing
Many applications can tolerate approximate data. 40% to nearly 100% of data footprint is approximate
[Sampson, MICRO 2013].
11
Approximate value locality: Many data values are similar to or can be approximated from
previously seen values.
Outline
• Load Value Approximation
• Non-Speculative Operation
• Approximator Design
• Relaxed Confidence Windows
• Approximation Degree
• Methodology
• Evaluation
12
Load Value Approximation
13
processor core
L1 cache
shared caches, directory, network-on-chip
main memory
Load Value Approximation
14
processor core
L1 cache
shared caches, directory, network-on-chip
main memory
approximator
Load Value Approximation
15
processor core
L1 cache
shared caches, directory, network-on-chip
main memory
approximator load miss A
A?
Load Value Approximation
16
processor core
L1 cache
shared caches, directory, network-on-chip
main memory
approximator generate A_approx
A?
Load Value Approximation
17
processor core
L1 cache
shared caches, directory, network-on-chip
main memory
approximator
A_approx
Load Value Approximation
18
processor core
L1 cache
shared caches, directory, network-on-chip
main memory
approximator
A_approx
No speculation, no rollbacks.
Load Value Approximation
19
processor core
L1 cache
shared caches, directory, network-on-chip
main memory
approximator
A_approx
Load Value Approximation
20
processor core
L1 cache
shared caches, directory, network-on-chip
A_approx
main memory
fetch A_actual
approximator
Load Value Approximation
21
processor core
L1 cache
shared caches, directory, network-on-chip
main memory
A_approx
approximator train with A_actual
Load Value Approximation
22
processor core
L1 cache
shared caches, directory, network-on-chip
main memory
approximator
Learns past values. Estimates future values. Improves performance and saves energy.
A_approx
Approximator Design
23
ℎ , instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
Approximator Design
24
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
time
Approximator Design
25
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A time
Approximator Design
26
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A time
PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1
Approximator Design
27
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A time
PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1
(4.1 + 3.9 + 4.0) / 3
A_approx = 4.0
Approximator Design
28
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx) time
Approximator Design
29
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx)
request(A_actual)
time
Approximator Design
30
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx)
request(A_actual) A_actual = 4.2
time
Approximator Design
31
ℎ , instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
load miss A do_work(A_approx)
request(A_actual) A_actual = 4.2
time
2.2 3.1 4.2
3.9 4.0 4.2
Approximator Design – Other Considerations
32
• Floating-point precision
• History buffer sizes
• Stale values
More details in paper.
Approximator Design
33
Relaxed Confidence Windows How do we avoid making bad approximations?
Trade-off performance and error.
Approximation Degree Do we need to fetch the actual value from memory every time?
Trade-off energy and error.
Relaxed Confidence Windows
34
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx)
request(A_actual)
time
A_approx = 4.0
Relaxed Confidence Windows
35
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx)
request(A_actual)
time
A_approx = 4.0
A_actual = 9.0!
Relaxed Confidence Windows
36
tag conf degree LHB
When approximating:
if conf >= 0: use A_approx
else: don’t use A_approx
When updating:
if A_approx, A_actual differ by <= CONF_WINDOW%: conf++
else: conf--
Relaxed Confidence Windows – Output Error
37
0%
20%
40%
60%
80%
100%
ou
tpu
t er
ror
0% 5% 10% 20% infinite
Varying CONF_WINDOW%:
Relaxed Confidence Windows – L1-D MPKI
38
Varying CONF_WINDOW%:
0.0
0.2
0.4
0.6
0.8
1.0
0% 5% 10% 20% infinite
no
rmal
ized
L1
-D M
PK
I
CONF_WINDOW%
Approximator Design
39
Relaxed Confidence Windows How do we avoid making bad approximations?
Trade-off performance and error.
Approximation Degree Do we need to fetch the actual value from memory every time?
Trade-off energy and error.
Approximation Degree
40
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx)
request(A_actual)
time
A_approx = 4.0
Approximation Degree
41
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx)
request(A_actual)
time
A_approx = 4.0
A_actual = 4.0
Approximation Degree
42
ℎ , 1.0 2.2 3.1 instruction
address
global history buffer
tag conf degree LHB
approximator table
𝑓
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx)
request(A_actual)
time
A_approx = 4.0
A_actual = 4.0
Approximation Degree
43
tag conf degree LHB
When approximating:
if degree == APPROX_DEGREE: fetch A_actual
else: don’t fetch A_actual
When updating:
if degree == APPROX_DEGREE: degree = 0
else: degree++
Approximation Degree – Output Error
44
Varying APPROX_DEGREE:
0%
20%
40%
60%
80%
100%
ou
tpu
t er
ror
0 1 2 4 8 16
Approximation Degree – L1-D Fetches
45
Varying APPROX_DEGREE:
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16
no
rmal
ized
L1
-D f
etch
es
APPROX_DEGREE
Methodology
46
Multi-threaded approximate applications PARSEC benchmark suite [Bienia, Princeton 2011]
Programmer annotations and ISA extensions [Esmaeilzadeh, ASPLOS 2012]
Approximator design space exploration Pin dynamic binary instrumentation tool [Luk, PLDI 2005]
Full-system simulation FeS2 cycle-level x86 simulator [Neelakantam, ASPLOS 2008]
Approximator, cache and memory energy consumption CACTI modeling tool [Thoziyoor, HP 2008]
Evaluation
47
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 4 16
APPROX_DEGREE
application speedup energy savings
Evaluation
48
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 4 16
APPROX_DEGREE
application speedup energy savings
Up to 28% speedup
Evaluation
49
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 4 16
APPROX_DEGREE
application speedup energy savings
Up to 28% speedup
Up to 44% energy savings
Evaluation
50
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 4 16
APPROX_DEGREE
application speedup energy savings
Reduces L1-D MPKI by 30% over traditional value predictor and prefetcher.
Up to 28% speedup
Up to 44% energy savings
Conclusion
51
Load Value Approximation
• Approximate Value Locality
• Non-Speculative
• Relaxed Confidence Windows
• Approximation Degree
↑ performance
↓ energy consumption
low output error
Conclusion
52
baseline (precise) load value approximation