self-learning, adaptive computer systems

Self-Learning, Adaptive Computer Systems

Intel Collaborative Research InstituteComputational Intelligence

Yoav Etsion, Technion CS & EE

Dan Tsafrir, Technion CS

Shie Mannor, Technion EE

Assaf Schuster, Technion CS

Adaptive Computer Systems• Complexity of computer systems keeps growing

• We are moving towards heterogeneous hardware• Workloads are getting more diverse• Process variability affects performance/power of

different parts of the system

• Human programmers and administrators• cannot handle complexity

• The goal: Adapt to workload and hardware variability


Predicting System Behavior

• When a human observes the workload, she can typically identify cause and effect

• Workload carries inherent semantics• The problem is extracting them automatically…

• Key issues with machine learning:• Huge datasets (performance counters; exec. traces)• Need extremely fast response time (in most cases)• Rigid space constraints for ML algorithms


Memory + Machine LearningCurrent state-of-the-art

• Architectures are tuned for structured data• Managed using simple heuristics

• Spatial and temporal locality• Frequency and recency (ARC)

• Block and stride prefetchers

• Real data is not well structured• Programmer must transform data• Unrealistic for program agnostic

management (swapping, prefetching)


Memory + Machine LearningMultiple learning opportunities

• Identify patterns using machine learning• Bring data to the right place at the right time

• Memory hierarchy forms a pyramid• Caches / DRAM, PCM / SSD, HDD

• Different levels require different learning strategies• Top: smaller, faster, costlier [prefetching to

caches]• Bottom: bigger, slower, pricier [fetching from disk]

• Need both hardware and software support


Research track:

Predicting Latent Faults in Data Centers


Moshe Gabel, Assaf Schuster

• Failures and misconfiguration happen in large datacenters• Cause performance anomalies?

• Sound statistical framework to detect latent faults• Practical:

Non-intrusive, unsupervised, no domain knowledge• Adaptive:

No parameter tuning, robust to system/workload changes


7

Latent Fault Detection

• Applied to real-world production service of 4.5K machines

• Over 20% machine/sw failures preceded by latent faults• Slow response time; network errors; disk access times

• Predict failures 14 days in advance, 70% precision, 2% FPR

• Latent Fault Detection in Large Scale Services, DSN 2012


8

Latent Fault Detection

Research track:

Task Differentials: Dynamic, inter-thread predictions

using memory access footsteps


Adi Fuchs , Yoav Etsion, Shie Mannor, Uri WeiserD

Motivation We are in the age of parallel computing.

Programming paradigms shift towards task level parallelism

Tasks are supported by libraries such as TBB and OpenMP:

Implicit forms of task level parallelism include GPU kernels and parallel loops

Tasks behavior tends to be highly regular = target for learning and adaptation


...GridLauncher<InitDensitiesAndForcesMTWorker> &id = *new (tbb::task::allocate_root()) GridLauncher<InitDensitiesAndForcesMTWorker>(NUM_TBB_GRIDS);tbb::task::spawn_root_and_wait(id);GridLauncher<ComputeDensitiesMTWorker> &cd = *new (tbb::task::allocate_root()) GridLauncher<ComputeDensitiesMTWorker>(NUM_TBB_GRIDS);tbb::task::spawn_root_and_wait(cd);...

Taken from: PARSEC.fluidanimate TBB implementation

Parallel sectionSynchronization

Parallel section

Synchronization

Synchronization

task

s

10

How do things currently work?• Programmer codes a parallel loop

• SW maps multiple tasks to one thread• HW sees a sequence of

instructions

• HW prefetchers try to identify patterns between consecutive memory accesses

• No notion of program semantics, i.e. execution consists of a sequence of tasks, not instructions


11

A

B

C

A B C D E E

Task Address Set Given the memory trace of task instance A, the task address set TA is a unique set of addresses ordered by access time:


Trace:START TASK INSTANCE(A)R 0x7f27bd6df8R 0x61e630R 0x6949ccR 0x7f77b02010R 0x6949ccR 0x61e6d0R 0x61e6e0W 0x7f77b02010STOP TASK INSTANCE(A)

TA:0x7f27bd6df80x61e6300x6949cc0x7f77b020100x61e6d00x61e6e0

1 2, ...A nT a a a

12

Address Differentials Motivation: Task instance address sets are usually meaningless


TA:7F27BD6DF8

61E630

6949CC

7F77B02010

61E6D0

61E6E0

+ 0 =

+ 8000480 =

+ 54080 = + 8770090 =

+ 456 =

-1808 = Differences tend to be compact and regular, thus can represent state transitions 13

TB:7F27BD6DF8

DBFA10

6A1D0C

7F7835F23A

61E898

61DFD0

TC:7F27BD6DF8

1560DF0

6AF04C

7F78BBC464

61EA60

61D8C0

+ 0 =

+ 8000480 =

+ 54080 = + 8770090 =

+ 456 =

-1808 =

Address Differentials Given instances A and B, the differential vector is defined as follows:

Example:


TA:10000 6000080000007F00000FE000

|AB i i i ib a for each i D TA

a1

DAB

1

2a2

TB

b1

b2

32, 96, 8, 64, 96

14

TB:10020 6006080000087F00040FE060

Differentials Behavior: Mathematical intuition


Differential use is beneficial in cases of high redundancy.

Application distribution functions can provide the intuition on vector repetitions.

Non uniform CDFs imply highly regular patterns.

Uniform CDFs imply noisy patterns (differentials behavior cannot be exploited)

Non uniform

Uniform

15

Differentials Behavior: Mathematical intuition


Given N vectors, straightforward dictionary will be of size: R=log2(N) Entropy H is a theoretical lower bound on representation, based on distribution:

Example – assuming 1000 vector instances with 4 possible values: R = 2.

Differential Entropy Compression Ratio (DECR) is used as repetition criteria:

1

logN

k

H p k p k

Differential Value #instances p(20,8000,720,100050) 700 0.7

(16,8040,-96,50) 150 0.15(0,0,14420,100) 50 0.05

(0,0,720,100050) 100 0.1

0.7 log 0.7 0.15 log 0.151.31

0.05 log 0.05 0.1 log 0.1H

Benchmark Suite Implementation Differential representation Differential entropy DECR (%)FFT.128M BOTS OpenMP 19.4 14.4 25.5NQUEENS.N=12 BOTS OpenMP 11.8 8.4 28.7SORT.8M BOTS OpenMP 16.4 16.3 0.1SGEFA.500x500 LINPACK OpenMP 14.1 0.9 93.6FLUIDANIMATE.SIMSMALL PARSEC TBB 16.4 8.0 51.3SWAPTIONS.SIMSMALL PARSEC TBB 17.9 13.1 26.6STREAMCLUSTER.SIMSMALL PARSEC TBB 19.6 8.9 54.4

16

Possible differential application: cache line prefetching


First attempt: Prefix based predictor, given a differential prefix – predict suffix Example: A and B finished running ( is stored) Now C is running…

17

TA:7F27BD6DF8

61E630

6949CC

7F77B02010

61E6D0

61E6E0

0,

8000480,

54080,

8770090,

456,

-1808

TB:7F27BD6DF8

DBFA10

6A1D0C

7F7835F23A

61E898

61DFD0

TC:7F27BD6DF8

1560DF0

6AF04C?

7F78BBC464?

61EA60?

61D8C0?

0,

8000480,

54080?

8770090?

456?

-1808?



Second attempt: PHT predictor, based on the last X differentials – predict next differential. Example:

32 96 8 64 96 32 96 8 64 96 10 16 0 16 32 32 96 8 64 96 32 96 8 64 96 10 16 0 16 32 32 96 8 64 96 32 96 8 64 96

18



Prefix policy: Differential DB is a prefix tree, Prediction performed once differential prefix is unique. PHT policy: Differential DB hold the history table, Prediction performed upon task start, based on history pattern:

19

Differential DB

Past Task Addresses

ExecutionCPUs

Caching Hierarchy

Current Task Addresses

New Memory Request

Differential logic

New Differential

Pre-fetch Addresses

Start task/Stop task



Predictors compared with 2 models: Base (no prefetching) and Ideal (theoretical predictor – accurately predicts every repeating differential)

NQ

UEEN

S.N=12

SWAPTIO

NS

FLUIDAN

IMATE

SGEFA.500

0

1

2

3

4

5

6

Misses Per 1K Instructions

BasePrefixPHTIdeal

STREAMCLU

STER

FFT.128M

SORT.8M

0

10

20

30

40

50

60

70

Misses Per 1K InstructionsBasePrefixPHTIdeal

Cache Miss Elimination (%)Prefix PHT Ideal

NQUEENS.N=12 19.4 11.4 62.1SWAPTIONS 18.3 0.1 49.2FLUIDANIMATE 14.9 26.0 46.0SGEFA.500 0.0 97.6 99.9STREAMCLUSTER 21.7 36.5 82.3FFT.128M 45.0 -1.0 87.9SORT.8M 3.3 0.0 0.1

20

Future work


Hybrid policies: which policy to use when? (PHT is better for complete vector repetitions, prefix is better for partial vector repetitions, i.e. suffixes)

Regular expression based policy (for pattern matching, beyond “ideal” model)

Predict other functional features using differentials (e.g. branch prediction, PTE prefetching etc.)

21

Conclusions (so far…)


• When we look at the data, patterns emerge…• Quite a large headroom for optimizing computer systems• Existing predictions are based on heuristics

• A machine that does not respond within 1s is considered dead• Memory prefetchers look for blocked and strided accesses

• Goal: Use ML, not heuristics, to uncover behavioral semantics

22

self-learning, adaptive computer systems

Documents

latent faultspractical

fpr latent fault detection

machine learningbring

right time memory hierarchy

technion eeassaf schuster

technion csshie mannor

technion cs eedan tsafrir

fast response time