cx: a scalable, robust network for parallel computing peter cappello & dimitrios mourloukos...

103
CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

Post on 22-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

CX: A Scalable, Robust Network for Parallel Computing

Peter Cappello & Dimitrios Mourloukos

Computer Science

UCSB

Page 2: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

2

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

Page 3: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

3

Introduction

• “Listen to the technology!” Carver Mead

Page 4: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

4

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

Page 5: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

5

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

– Internet’s idle cycles/sec growing rapidly

Page 6: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

6

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

– Internet’s idle cycles/sec growing rapidly

– Bandwidth increasing & getting cheaper

Page 7: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

7

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

– Internet’s idle cycles/sec growing rapidly

– Bandwidth is increasing & getting cheaper

– Communication latency is not decreasing

Page 8: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

8

Introduction

• “Listen to the technology!” Carver Mead

• What is the technology telling us?

– Internet’s idle cycles/sec growing rapidly

– Bandwidth increasing & getting cheaper

– Communication latency is not decreasing

– Human technology is getting neither

cheaper nor faster.

Page 9: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

9

Introduction

Project Goals

1. Minimize job completion time

despite large communication latency

Page 10: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

10

Introduction

Project Goals

1. Minimize job completion time

despite large communication latency

2. Jobs complete with high probability

despite faulty components

Page 11: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

11

Introduction

Project Goals

1. Minimize job completion time

despite large communication latency

2. Jobs complete with high probability

despite faulty components

3. Application program is oblivious to:• Number of processors

• Inter-process communication

• Fault tolerance

Page 12: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

12

Heterogeneous machine/OS

Introduction

Fundamental Issue: Heterogeneity

M1

OS1

M2

OS2

M3

OS3

M4

OS4

M5

OS5…

Page 13: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

13

Heterogeneous machine/OS

Introduction

Fundamental Issue: Heterogeneity

M1

OS1

M2

OS2

M3

OS3

M4

OS4

M5

OS5…

Functionally Homogeneous

JVM

Page 14: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

14

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

Page 15: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

15

Related work

• Cilk Cilk-NOW Atlas

– DAG computational model

– Work-stealing

Page 16: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

16

Related work

• Linda Piranha JavaSpaces

– Space-based coordination

– Decoupled communication

Page 17: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

17

Related work

• Charlotte (Milan project / Calypso prototype)

– High performance Fault tolerance not

achieved via transactions

– Fault tolerance via eager scheduling

Page 18: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

18

Related work

• SuperWeb JavelinJavelin++– Architecture: client, broker, host

Page 19: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

19

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

Page 20: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

20

API

DAG Computational model

int f( int n )

{

if ( n < 2 )

return n;

else

return f( n-1 ) + f( n-2 );

}

Page 21: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

21

DAG Computational Model

int f( int n ) {

if ( n < 2 ) return n;

else return f( n-1 ) + f( n-2 );

}

f(4)

Method invocation tree

Page 22: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

22

DAG Computational Model

int f( int n ) {

if ( n < 2 ) return n;

else return f( n-1 ) + f( n-2 );

}

f(4)

f(3) f(2)

Method invocation tree

Page 23: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

23

DAG Computational Model

int f( int n ) {

if ( n < 2 ) return n;

else return f( n-1 ) + f( n-2 );

}

f(4)

f(3) f(2)

f(2) f(1) f(1) f(0)

Method invocation tree

Page 24: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

24

DAG Computational Model

int f( int n ) {

if ( n < 2 ) return n;

else return f( n-1 ) + f( n-2 );

}

f(4)

f(3) f(2)

f(1) f(1) f(0)

f(1) f(0)

Method invocation tree

f(2)

Page 25: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

25

DAG Computational Model / API

f(4) execute( ) {

if ( n < 2 )

setArg( , n );

else {

spawn ( );

spawn ( );

spawn ( );

}

}

_______________________________

f(n-1)

+

+

execute( ) {

setArg( , in[0] + in[1] );

}

f(n)

+

+

f(n-2)

Page 26: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

26

DAG Computational Model / API

execute( ) {

setArg( , in[0] + in[1] );

}

+

+

f(4)

f(3) f(2)

+

execute( ) {

if ( n < 2 )

setArg( , n );

else {

spawn ( );

spawn ( );

spawn ( );

}

}

_______________________________

f(n-1)

+

+

f(n)

f(n-2)

Page 27: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

27

DAG Computational Model / API

execute( ) {

setArg( , in[0] + in[1] );

}

+

+

f(4)

f(3) f(2)

+

f(2) f(1) f(1) f(0)

+

+

execute( ) {

if ( n < 2 )

setArg( , n );

else {

spawn ( );

spawn ( );

spawn ( );

}

}

_______________________________

f(n-1)

+

+

f(n)

f(n-2)

Page 28: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

28

DAG Computational Model / API

execute( ) {

setArg( , in[0] + in[1] );

}

+

+

f(4)

f(3) f(2)

+

f(2) f(1) f(1) f(0)

+

+

f(1) f(0)

+

execute( ) {

if ( n < 2 )

setArg( , n );

else {

spawn ( );

spawn ( );

spawn ( );

}

}

_______________________________

f(n-1)

+

+

f(n)

f(n-2)

Page 29: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

29

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

Page 30: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

30

Architecture: Basic Entities

CONSUMERPRODUCTION

NETWORK

CLUSTERNETWORK

register ( spawn | getResult )* unregister

Page 31: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

31

Architecture: Cluster

TASKSERVERPRODUCER

PRODUCER

PRODUCER

PRODUCER

Page 32: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

32

A Cluster at Work

f(4)

f(3) f(2)

+

f(2) f(1) f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

Page 33: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

33

A Cluster at Work

f(4)

TASKSERVER

PRODUCER

PRODUCER WAITING

READYf(4)

Page 34: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

34

A Cluster at Work

f(4)

TASKSERVER

PRODUCER

PRODUCER WAITING

READYf(4) f(4)

Page 35: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

35

A Cluster at Work

f(4)

TASKSERVER

PRODUCER

PRODUCER WAITING

READYf(4)

Page 36: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

36

Decompose

execute( )

{

if ( n < 2 )

setArg( ArgAddr, n );

else

{

spawn ( + );

spawn ( f(n-1) );

spawn ( f(n-2) );

}

}

Page 37: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

37

A Cluster at Work

f(4)

f(3) f(2)

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

f(4)

+

f(3)

f(2)

Page 38: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

38

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

f(3)

f(2)

f(3) f(2)

+

Page 39: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

39

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

f(3)

f(2)

f(3)

f(2)

f(3) f(2)

+

Page 40: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

40

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

f(3)

f(2)

f(3) f(2)

+

Page 41: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

41

A Cluster at Work

f(3) f(2)

+

f(2) f(1) f(1) f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

f(3)

f(2) +

f(2)

f(1)

+

f(1) f(0)

Page 42: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

42

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

f(2)

f(1)

+

f(1) f(0)

+

f(2) f(1) f(1) f(0)

+

+

Page 43: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

43

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

f(2)

f(1) +

f(1) f(0)

f(2)

f(1)

+

f(2) f(1) f(1) f(0)

+

+

Page 44: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

44

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++f(1) +

f(0)

f(2)

f(1)

+

f(2) f(1) f(1) f(0)

+

+

Page 45: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

45

Compute Base Case

execute( )

{

if ( n < 2 )

setArg( ArgAddr, n );

else

{

spawn ( + );

spawn ( f(n-1) );

spawn ( f(n-2) );

}

}

Page 46: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

46

A Cluster at Work

+

f(2) f(1) f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++f(1) +

f(0)

f(2)

f(1)

+

f(1)

f(0)

Page 47: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

47

A Cluster at Work

+

f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

f(0)f(1)

+

f(1)

f(0)

Page 48: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

48

A Cluster at Work

+

f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

f(0)f(1)

+

f(1)

f(0)f(1)

f(0)

Page 49: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

49

A Cluster at Work

+

f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

+

f(1)

f(0)f(1)

f(0)

Page 50: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

50

A Cluster at Work

+

f(1) f(0)

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

+

f(1)

f(0)f(1)

f(0)

Page 51: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

51

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

+

f(1)

f(0)

+

Page 52: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

52

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(1)

f(0)

+

Page 53: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

53

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(1)

f(0)

+

+

f(1)

Page 54: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

54

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

+

f(1)

Page 55: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

55

Compose

execute( )

{

setArg( ArgAddr, in[0] + in[1] );

}

Page 56: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

56

A Cluster at Work

+

f(1) f(0)

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

+

f(1)

Page 57: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

57

A Cluster at Work

+

f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

Page 58: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

58

A Cluster at Work

+

f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

f(0)

Page 59: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

59

A Cluster at Work

+

f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

Page 60: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

60

A Cluster at Work

+

f(0)

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

f(0)

Page 61: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

61

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

+

Page 62: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

62

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

Page 63: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

63

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

+

Page 64: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

64

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

Page 65: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

65

A Cluster at Work

+

+

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+++

Page 66: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

66

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

++

+

Page 67: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

67

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

Page 68: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

68

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

+

Page 69: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

69

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

Page 70: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

70

A Cluster at Work

++

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

Page 71: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

71

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

+

+

Page 72: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

72

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY+

Page 73: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

73

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY+

+

Page 74: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

74

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY+

Page 75: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

75

A Cluster at Work

+

TASKSERVER

PRODUCER

PRODUCER WAITING

READY+

R

Page 76: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

76

A Cluster at Work

TASKSERVER

PRODUCER

PRODUCER WAITING

READY

R

1. Result object is sent to Production Network

2. Production Network returns it to Consumer

Page 77: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

77

Task Server ProxyOverlap Communication with Computation

PRODUCER

Task Server Proxy

OUTBOX

INBOXCOMMCOMP

READY

WAITING

TASK SERVER

PRIORITY Q

Page 78: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

78

Architecture Work stealing & eager scheduling

• A task is removed from server only after a

complete signal is received.

• A task may be assigned to multiple producers

– Balance task load among producers of varying

processor speeds

– Tasks on failed/retreating producers are re-assigned.

Page 79: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

79

Architecture: Scalability

• A cluster tolerates producer:

– Retreat

– Failure

• 1 task server however is a:

– Bottleneck

– Single point of failure.

• We introduce a network of task servers.

Page 80: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

80

Scalability: Class loading

1. CX class loader loads classes (Consumer JAR) in each server’s class cache

2. Producer loads classes from its server

Page 81: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

81

Scalability: Fault-tolerance

Replicate a server’s tasks on its sibling

Page 82: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

82

Scalability: Fault-tolerance

Replicate a server’s tasks on its sibling

Page 83: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

83

Scalability: Fault-tolerance

Replicate a server’s tasks on its sibling

When server fails,its sibling restores stateto replacement server

Page 84: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

84

Architecture

Production network of clusters

• Network tolerates single server failure.

• Restores ability to tolerate a single failure.

ability to tolerate a sequence of failures

Page 85: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

85

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

Page 86: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

86

Preliminary experiments

• Experiments run on Linux cluster

– 100 port Lucent P550 Cajun Gigabit Switch

• Machine

– 2 Intel EtherExpress Pro 100 Mb/s Ethernet cards

– Red Hat Linux 6.0

– JDK 1.2.2_RC3

– Heterogeneous

• processor speeds

• processors/machine

Page 87: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

87

Fibonacci Tasks with Synthetic Load

+

+

f(n-1)

+

+

f(n)

f(n-2)

execute( ) {

if ( n < 2 )

synthetic workload();

setArg( , n );

else {

synthetic workload();

spawn ( );

spawn ( );

spawn ( );

}

}

execute( ) {

synthetic workload();

setArg( , in[0] + in[1] );

}

Page 88: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

88

TSEQ vs. T1 (seconds)Computing F(8)

Workload TSEQ T1 Efficiency

4.522 497.420 518.816 0.96

3.740 415.140 436.897 0.95

2.504 280.448 297.474 0.94

1.576 179.664 199.423 0.90

0.914 106.024 120.807 0.88

0.468 56.160 65.767 0.85

0.198 24.750 29.553 0.84

0.058 8.120 11.386 0.71

Page 89: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

89

Parallel Efficiency over 60 nodes

0

0.2

0.4

0.6

0.8

1

1.2

F(13) Fib(14) Fib(15) Fib(16) Fib(17) Fib(18)

Par

alle

l E

ffic

ien

cy

Workload 1

Workload 2

Parallel efficiency for F(13) = 0.87Parallel efficiency for F(18) = 0.99

Average task time:Workload 1 = 1.8 sec.Workload 2 = 3.7 sec.

Page 90: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

90

Outline

1. Introduction

2. Related work

3. API

4. Architecture

5. Experimental results

6. Current & future work

Page 91: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

91

Current work

• Implement CX market maker (broker)

Solves discovery problem between Consumers & Production

networks

• Enhance Producer with Lea’s Fork/Join Framework

– See gee.cs.oswego.edu

CONSUMER PRODUCTIONNETWORKCONSUMERCONSUMERCONSUMER

PRODUCTIONNETWORK

PRODUCTIONNETWORK

PRODUCTIONNETWORK

MARKETMAKER} {

JINI Service

Page 92: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

92

Current work

• Enhance computational model: branch & bound.

– Propagate new bounds thru production network: 3 steps

PRODUCTION NETWORK

SEARCH TREE

TERMINATE!

BRANCH

Page 93: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

93

Current work

• Enhance computational model: branch & bound.

– Propagate new bounds thru production network: 3 steps

PRODUCTION NETWORK

SEARCH TREE

TERMINATE!

Page 94: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

94

Current work

• Investigate computations that appear

ill-suited to adaptive parallelism

– SOR

– N-body.

Page 96: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

96

Introduction

Fundamental Issues

• Communication latency

Long latency Overlap computation with communication.

• Robustness

Massive parallelism faults

• Scalability

Massive parallelism login privileges cannot be required.

• Ease of use

Jini easy upgrade of system components

Page 97: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

97

Related work

• Market mechanisms– Huberman, Waldspurger, Malone, Miller &

Drexler, Newhouse & Darlington

Page 98: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

98

Related work

• CX integrates

– DAG computational model

– Work-stealing scheduler

– Space-based, decoupled communication

– Fault-tolerance via eager scheduling

– Market mechanisms (incentive to participate)

Page 99: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

99

Architecture Task identifier

• Dag has spawn tree• TaskID = path id• Root.TaskID = 0• TaskID used to detect

duplicate: – Tasks– Results.

F(4)

F(3) F(2)

+

F(2) F(1) F(1) F(0)

F(1) F(0)

+

+

+

0

000

2

1

1

1

1

22

2

Page 100: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

100

Architecture: Basic Entities

• Consumer

Seeks computing resources.

• Producer

Offers computing resources.

• Task Server

Coordinates task distribution among its producers.

• Production Network

A network of task servers & their associated producers.

Page 101: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

101

Defining Parallel Efficiency

• Scalar: Homogeneous set of P machines:

Parallel efficiency = (T1 / P) / TP

• Vector: Heterogeneous set of P machines:

P = [ P1, P2, …, Pd ], where there are

P1 machines of type 1,

P2 machines of type 2, …

Pd machines of type d :

Parallel efficiency = ( P1 / T1 + P2 / T2 + … Pd / Td ) –1 / TP

Page 102: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

102

Future work

• Support special hardware / data: inter-server task

movement.

– Diffusion model:

Tasks are homogeneous gas atoms diffusing through network.

– N-body model: Each kind of atom (task) has its own:

• Mass (resistance to movement: code size, input size, …)

• attraction/repulsion to different servers

Or other “massive” entities, such as:

» special processors

» large data base.

Page 103: CX: A Scalable, Robust Network for Parallel Computing Peter Cappello & Dimitrios Mourloukos Computer Science UCSB

103

Future Work

• CX preprocessor to simplify API.