load value approximation - stacs lab · 2019-10-25 · relaxed confidence windows 34 ℎ i...

Load Value Approximation

Joshua San Miguel

Mario Badr

Natalie Enright Jerger

Accessing Memory

2

shared caches, directory, network-on-chip

main memory

processor core

L1 cache

Accessing Memory

3


main memory

miss

processor core

L1 cache

Accessing Memory

4


main memory

Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache!

processor core

L1 cache miss

Accessing Memory

5


main memory

Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache!

Higher efficiency via Approximate Computing…

processor core

L1 cache miss

Approximate Computing

Not all computations need to be precise.

6

http://www.zentut.com/

http://www.businessweek.com/

http://www.cc.gatech.edu/~cnieto6/

http://www.analyticbridge.com/

http://themusicparlour.blogspot.ca/

http://www.scientific-computing.com/

Data mining

Computer vision Audio and video processing

Gaming Machine learning Dynamical simulation


7

execution time

energy


8

execution time

energy error


9

execution time

energy error


Many applications can tolerate approximate data. 40% to nearly 100% of data footprint is approximate

[Sampson, MICRO 2013].

10


Many applications can tolerate approximate data. 40% to nearly 100% of data footprint is approximate

[Sampson, MICRO 2013].

11

Approximate value locality: Many data values are similar to or can be approximated from

previously seen values.

Outline

• Load Value Approximation

• Non-Speculative Operation

• Approximator Design

• Relaxed Confidence Windows

• Approximation Degree

• Methodology

• Evaluation

12


13

processor core

L1 cache


main memory


14

processor core

L1 cache


main memory

approximator


15

processor core

L1 cache


main memory

approximator load miss A

A?


16

processor core

L1 cache


main memory

approximator generate A_approx

A?


17

processor core

L1 cache


main memory

approximator

A_approx


18

processor core

L1 cache


main memory

approximator

A_approx

No speculation, no rollbacks.


19

processor core

L1 cache


main memory

approximator

A_approx


20

processor core

L1 cache


A_approx

main memory

fetch A_actual

approximator


21

processor core

L1 cache


main memory

A_approx

approximator train with A_actual


22

processor core

L1 cache


main memory

approximator

Learns past values. Estimates future values. Improves performance and saves energy.

A_approx

Approximator Design

23

ℎ , instruction

address

global history buffer

tag conf degree LHB

approximator table

𝑓

local history buffer

Approximator Design

24

ℎ , 1.0 2.2 3.1 instruction

address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0

time

Approximator Design

25


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0

load miss A time

Approximator Design

26


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0

load miss A time

PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1

Approximator Design

27


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0

load miss A time

PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1

(4.1 + 3.9 + 4.0) / 3

A_approx = 4.0

Approximator Design

28


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0

load miss A do_work(A_approx) time

Approximator Design

29


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0

load miss A do_work(A_approx)

request(A_actual)

time

Approximator Design

30


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0


request(A_actual) A_actual = 4.2

time

Approximator Design

31

ℎ , instruction

address


tag conf degree LHB

approximator table

𝑓



request(A_actual) A_actual = 4.2

time

2.2 3.1 4.2

3.9 4.0 4.2

Approximator Design – Other Considerations

32

• Floating-point precision

• History buffer sizes

• Stale values

More details in paper.

Approximator Design

33

Relaxed Confidence Windows How do we avoid making bad approximations?

Trade-off performance and error.

Approximation Degree Do we need to fetch the actual value from memory every time?

Trade-off energy and error.

Relaxed Confidence Windows

34


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0


request(A_actual)

time

A_approx = 4.0


35


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0


request(A_actual)

time

A_approx = 4.0

A_actual = 9.0!


36

tag conf degree LHB

When approximating:

if conf >= 0: use A_approx

else: don’t use A_approx

When updating:

if A_approx, A_actual differ by <= CONF_WINDOW%: conf++

else: conf--

Relaxed Confidence Windows – Output Error

37

0%

20%

40%

60%

80%

100%

ou

tpu

t er

ror

0% 5% 10% 20% infinite

Varying CONF_WINDOW%:

Relaxed Confidence Windows – L1-D MPKI

38

Varying CONF_WINDOW%:

0.0

0.2

0.4

0.6

0.8

1.0

0% 5% 10% 20% infinite

no

rmal

ized

L1

-D M

PK

I

CONF_WINDOW%

Approximator Design

39

Relaxed Confidence Windows How do we avoid making bad approximations?

Trade-off performance and error.

Approximation Degree Do we need to fetch the actual value from memory every time?

Trade-off energy and error.

Approximation Degree

40


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0


request(A_actual)

time

A_approx = 4.0


41


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0


request(A_actual)

time

A_approx = 4.0

A_actual = 4.0


42


address


tag conf degree LHB

approximator table

𝑓


4.1 3.9 4.0


request(A_actual)

time

A_approx = 4.0

A_actual = 4.0


43

tag conf degree LHB

When approximating:

if degree == APPROX_DEGREE: fetch A_actual

else: don’t fetch A_actual

When updating:

if degree == APPROX_DEGREE: degree = 0

else: degree++

Approximation Degree – Output Error

44

Varying APPROX_DEGREE:

0%

20%

40%

60%

80%

100%

ou

tpu

t er

ror

0 1 2 4 8 16

Approximation Degree – L1-D Fetches

45

Varying APPROX_DEGREE:

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16

no

rmal

ized

L1

-D f

etch

es

APPROX_DEGREE

Methodology

46

Multi-threaded approximate applications PARSEC benchmark suite [Bienia, Princeton 2011]

Programmer annotations and ISA extensions [Esmaeilzadeh, ASPLOS 2012]

Approximator design space exploration Pin dynamic binary instrumentation tool [Luk, PLDI 2005]

Full-system simulation FeS2 cycle-level x86 simulator [Neelakantam, ASPLOS 2008]

Approximator, cache and memory energy consumption CACTI modeling tool [Thoziyoor, HP 2008]

Evaluation

47

0%

2%

4%

6%

8%

10%

12%

14%

16%

0 4 16

APPROX_DEGREE

application speedup energy savings

Evaluation

48

0%

2%

4%

6%

8%

10%

12%

14%

16%

0 4 16

APPROX_DEGREE


Up to 28% speedup

Evaluation

49

0%

2%

4%

6%

8%

10%

12%

14%

16%

0 4 16

APPROX_DEGREE


Up to 28% speedup

Up to 44% energy savings

Evaluation

50

0%

2%

4%

6%

8%

10%

12%

14%

16%

0 4 16

APPROX_DEGREE


Reduces L1-D MPKI by 30% over traditional value predictor and prefetcher.

Up to 28% speedup

Up to 44% energy savings

Conclusion

51


• Approximate Value Locality

• Non-Speculative

• Relaxed Confidence Windows

• Approximation Degree

↑ performance

↓ energy consumption

low output error

Conclusion

52

baseline (precise) load value approximation

load value approximation - stacs lab · 2019-10-25 · relaxed confidence windows 34 ℎ i...

Documents