self-learning, adaptive computer systems

22
Self-Learning, Adaptive Computer Systems Intel Collaborative Research Institute Computational Intelligence Yoav Etsion, Technion CS & EE Dan Tsafrir, Technion CS Shie Mannor, Technion EE Assaf Schuster, Technion CS

Upload: kaveri

Post on 23-Feb-2016

24 views

Category:

Documents


1 download

DESCRIPTION

Intel Collaborative Research Institute Computational Intelligence. Self-Learning, Adaptive Computer Systems. Yoav Etsion , Technion CS & EE Dan Tsafrir , Technion CS Shie Mannor , Technion EE Assaf Schuster, Technion CS. Intel Collaborative Research Institute - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Self-Learning, Adaptive  Computer Systems

Self-Learning, Adaptive Computer Systems

Intel Collaborative Research InstituteComputational Intelligence

Yoav Etsion, Technion CS & EE

Dan Tsafrir, Technion CS

Shie Mannor, Technion EE

Assaf Schuster, Technion CS

Page 2: Self-Learning, Adaptive  Computer Systems

Adaptive Computer Systems• Complexity of computer systems keeps growing

• We are moving towards heterogeneous hardware• Workloads are getting more diverse• Process variability affects performance/power of

different parts of the system

• Human programmers and administrators• cannot handle complexity

• The goal: Adapt to workload and hardware variability

Intel Collaborative Research InstituteComputational Intelligence

Page 3: Self-Learning, Adaptive  Computer Systems

Predicting System Behavior

• When a human observes the workload, she can typically identify cause and effect

• Workload carries inherent semantics• The problem is extracting them automatically…

• Key issues with machine learning:• Huge datasets (performance counters; exec. traces)• Need extremely fast response time (in most cases)• Rigid space constraints for ML algorithms

Intel Collaborative Research InstituteComputational Intelligence

Page 4: Self-Learning, Adaptive  Computer Systems

Memory + Machine LearningCurrent state-of-the-art

• Architectures are tuned for structured data• Managed using simple heuristics

• Spatial and temporal locality• Frequency and recency (ARC)

• Block and stride prefetchers

• Real data is not well structured• Programmer must transform data• Unrealistic for program agnostic

management (swapping, prefetching)

Intel Collaborative Research InstituteComputational Intelligence

Page 5: Self-Learning, Adaptive  Computer Systems

Memory + Machine LearningMultiple learning opportunities

• Identify patterns using machine learning• Bring data to the right place at the right time

• Memory hierarchy forms a pyramid• Caches / DRAM, PCM / SSD, HDD

• Different levels require different learning strategies• Top: smaller, faster, costlier [prefetching to

caches]• Bottom: bigger, slower, pricier [fetching from disk]

• Need both hardware and software support

Intel Collaborative Research InstituteComputational Intelligence

Page 6: Self-Learning, Adaptive  Computer Systems

Research track:

Predicting Latent Faults in Data Centers

Intel Collaborative Research InstituteComputational Intelligence

Moshe Gabel, Assaf Schuster

Page 7: Self-Learning, Adaptive  Computer Systems

• Failures and misconfiguration happen in large datacenters• Cause performance anomalies?

• Sound statistical framework to detect latent faults• Practical:

Non-intrusive, unsupervised, no domain knowledge• Adaptive:

No parameter tuning, robust to system/workload changes

Intel Collaborative Research InstituteComputational Intelligence

7

Latent Fault Detection

Page 8: Self-Learning, Adaptive  Computer Systems

• Applied to real-world production service of 4.5K machines

• Over 20% machine/sw failures preceded by latent faults• Slow response time; network errors; disk access times

• Predict failures 14 days in advance, 70% precision, 2% FPR

• Latent Fault Detection in Large Scale Services, DSN 2012

Intel Collaborative Research InstituteComputational Intelligence

8

Latent Fault Detection

Page 9: Self-Learning, Adaptive  Computer Systems

Research track:

Task Differentials: Dynamic, inter-thread predictions

using memory access footsteps

Intel Collaborative Research InstituteComputational Intelligence

Adi Fuchs , Yoav Etsion, Shie Mannor, Uri WeiserD

Page 10: Self-Learning, Adaptive  Computer Systems

Motivation We are in the age of parallel computing.

Programming paradigms shift towards task level parallelism

Tasks are supported by libraries such as TBB and OpenMP:

Implicit forms of task level parallelism include GPU kernels and parallel loops

Tasks behavior tends to be highly regular = target for learning and adaptation

Intel Collaborative Research InstituteComputational Intelligence

...GridLauncher<InitDensitiesAndForcesMTWorker> &id = *new (tbb::task::allocate_root()) GridLauncher<InitDensitiesAndForcesMTWorker>(NUM_TBB_GRIDS);tbb::task::spawn_root_and_wait(id);GridLauncher<ComputeDensitiesMTWorker> &cd = *new (tbb::task::allocate_root()) GridLauncher<ComputeDensitiesMTWorker>(NUM_TBB_GRIDS);tbb::task::spawn_root_and_wait(cd);...

Taken from: PARSEC.fluidanimate TBB implementation

Parallel sectionSynchronization

Parallel section

Synchronization

Synchronization

task

s

10

Page 11: Self-Learning, Adaptive  Computer Systems

How do things currently work?• Programmer codes a parallel loop

• SW maps multiple tasks to one thread• HW sees a sequence of

instructions

• HW prefetchers try to identify patterns between consecutive memory accesses

• No notion of program semantics, i.e. execution consists of a sequence of tasks, not instructions

Intel Collaborative Research InstituteComputational Intelligence

11

A

B

C

A B C D E E

Page 12: Self-Learning, Adaptive  Computer Systems

Task Address Set Given the memory trace of task instance A, the task address set TA is a unique set of addresses ordered by access time:

Intel Collaborative Research InstituteComputational Intelligence

Trace:START TASK INSTANCE(A)R 0x7f27bd6df8R 0x61e630R 0x6949ccR 0x7f77b02010R 0x6949ccR 0x61e6d0R 0x61e6e0W 0x7f77b02010STOP TASK INSTANCE(A)

TA:0x7f27bd6df80x61e6300x6949cc0x7f77b020100x61e6d00x61e6e0

1 2, ...A nT a a a

12

Page 13: Self-Learning, Adaptive  Computer Systems

Address Differentials Motivation: Task instance address sets are usually meaningless

Intel Collaborative Research InstituteComputational Intelligence

TA:7F27BD6DF8

61E630

6949CC

7F77B02010

61E6D0

61E6E0

+ 0 =

+ 8000480 =

+ 54080 = + 8770090 =

+ 456 =

-1808 = Differences tend to be compact and regular, thus can represent state transitions 13

TB:7F27BD6DF8

DBFA10

6A1D0C

7F7835F23A

61E898

61DFD0

TC:7F27BD6DF8

1560DF0

6AF04C

7F78BBC464

61EA60

61D8C0

+ 0 =

+ 8000480 =

+ 54080 = + 8770090 =

+ 456 =

-1808 =

Page 14: Self-Learning, Adaptive  Computer Systems

Address Differentials Given instances A and B, the differential vector is defined as follows:

Example:

Intel Collaborative Research InstituteComputational Intelligence

TA:10000 6000080000007F00000FE000

|AB i i i ib a for each i D TA

a1

DAB

1

2a2

TB

b1

b2

32, 96, 8, 64, 96

14

TB:10020 6006080000087F00040FE060

Page 15: Self-Learning, Adaptive  Computer Systems

Differentials Behavior: Mathematical intuition

Intel Collaborative Research InstituteComputational Intelligence

Differential use is beneficial in cases of high redundancy.

Application distribution functions can provide the intuition on vector repetitions.

Non uniform CDFs imply highly regular patterns.

Uniform CDFs imply noisy patterns (differentials behavior cannot be exploited)

Non uniform

Uniform

15

Page 16: Self-Learning, Adaptive  Computer Systems

Differentials Behavior: Mathematical intuition

Intel Collaborative Research InstituteComputational Intelligence

Given N vectors, straightforward dictionary will be of size: R=log2(N) Entropy H is a theoretical lower bound on representation, based on distribution:

Example – assuming 1000 vector instances with 4 possible values: R = 2.

Differential Entropy Compression Ratio (DECR) is used as repetition criteria:

1

logN

k

H p k p k

Differential Value #instances p(20,8000,720,100050) 700 0.7

(16,8040,-96,50) 150 0.15(0,0,14420,100) 50 0.05

(0,0,720,100050) 100 0.1

0.7 log 0.7 0.15 log 0.151.31

0.05 log 0.05 0.1 log 0.1H

Benchmark Suite Implementation Differential representation Differential entropy DECR (%)FFT.128M BOTS OpenMP 19.4 14.4 25.5NQUEENS.N=12 BOTS OpenMP 11.8 8.4 28.7SORT.8M BOTS OpenMP 16.4 16.3 0.1SGEFA.500x500 LINPACK OpenMP 14.1 0.9 93.6FLUIDANIMATE.SIMSMALL PARSEC TBB 16.4 8.0 51.3SWAPTIONS.SIMSMALL PARSEC TBB 17.9 13.1 26.6STREAMCLUSTER.SIMSMALL PARSEC TBB 19.6 8.9 54.4

16

Page 17: Self-Learning, Adaptive  Computer Systems

Possible differential application: cache line prefetching

Intel Collaborative Research InstituteComputational Intelligence

First attempt: Prefix based predictor, given a differential prefix – predict suffix Example: A and B finished running ( is stored) Now C is running…

17

TA:7F27BD6DF8

61E630

6949CC

7F77B02010

61E6D0

61E6E0

0,

8000480,

54080,

8770090,

456,

-1808

TB:7F27BD6DF8

DBFA10

6A1D0C

7F7835F23A

61E898

61DFD0

TC:7F27BD6DF8

1560DF0

6AF04C?

7F78BBC464?

61EA60?

61D8C0?

0,

8000480,

54080?

8770090?

456?

-1808?

Page 18: Self-Learning, Adaptive  Computer Systems

Possible differential application: cache line prefetching

Intel Collaborative Research InstituteComputational Intelligence

Second attempt: PHT predictor, based on the last X differentials – predict next differential. Example:

32 96 8 64 96 32 96 8 64 96 10 16 0 16 32 32 96 8 64 96 32 96 8 64 96 10 16 0 16 32 32 96 8 64 96 32 96 8 64 96

18

Page 19: Self-Learning, Adaptive  Computer Systems

Possible differential application: cache line prefetching

Intel Collaborative Research InstituteComputational Intelligence

Prefix policy: Differential DB is a prefix tree, Prediction performed once differential prefix is unique. PHT policy: Differential DB hold the history table, Prediction performed upon task start, based on history pattern:

19

Differential DB

Past Task Addresses

ExecutionCPUs

Caching Hierarchy

Current Task Addresses

New Memory Request

Differential logic

New Differential

Pre-fetch Addresses

Start task/Stop task

Page 20: Self-Learning, Adaptive  Computer Systems

Possible differential application: cache line prefetching

Intel Collaborative Research InstituteComputational Intelligence

Predictors compared with 2 models: Base (no prefetching) and Ideal (theoretical predictor – accurately predicts every repeating differential)

NQ

UEEN

S.N=12

SWAPTIO

NS

FLUIDAN

IMATE

SGEFA.500

0

1

2

3

4

5

6

Misses Per 1K Instructions

BasePrefixPHTIdeal

STREAMCLU

STER

FFT.128M

SORT.8M

0

10

20

30

40

50

60

70

Misses Per 1K InstructionsBasePrefixPHTIdeal

Cache Miss Elimination (%)Prefix PHT Ideal

NQUEENS.N=12 19.4 11.4 62.1SWAPTIONS 18.3 0.1 49.2FLUIDANIMATE 14.9 26.0 46.0SGEFA.500 0.0 97.6 99.9STREAMCLUSTER 21.7 36.5 82.3FFT.128M 45.0 -1.0 87.9SORT.8M 3.3 0.0 0.1

20

Page 21: Self-Learning, Adaptive  Computer Systems

Future work

Intel Collaborative Research InstituteComputational Intelligence

Hybrid policies: which policy to use when? (PHT is better for complete vector repetitions, prefix is better for partial vector repetitions, i.e. suffixes)

Regular expression based policy (for pattern matching, beyond “ideal” model)

Predict other functional features using differentials (e.g. branch prediction, PTE prefetching etc.)

21

Page 22: Self-Learning, Adaptive  Computer Systems

Conclusions (so far…)

Intel Collaborative Research InstituteComputational Intelligence

• When we look at the data, patterns emerge…• Quite a large headroom for optimizing computer systems• Existing predictions are based on heuristics

• A machine that does not respond within 1s is considered dead• Memory prefetchers look for blocked and strided accesses

• Goal: Use ML, not heuristics, to uncover behavioral semantics

22