an implementation and performance evaluation of language with fine-grain thread creation on shared...

26
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo

Upload: virgil-hensley

Post on 02-Jan-2016

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation

on Shared Memory Parallel Computer

Yoshihiro Oyama, Kenjiro Taura,

Toshio Endo, Akinori Yonezawa

Department of Information Science, Faculty of Science,

University of Tokyo

Page 2: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Background

“Irregular” parallel applications• Tasks are not identified until runtime• synchronization structure is complicated

Languages with fine-grain threads• promising approach to handle the complexity

Page 3: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Motivation

Q: Are fine-grain threads really effective?

• Easy to describe irregular parallelism?• Scalable?• Fast?

Case studies to answer the Q are few

Many sophisticated designs and implementation techniqueshave been proposed so far, but

Page 4: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Goal

Case study to better understandthe effectiveness of fine-grain threads

C + Solaris threads

VS.

• program description cost• speed on 1 PE• scalability on 64PE SMP

in terms of

our language Schematic

approach w/o fine-grain threads

approach withfine-grain threads

Page 5: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Overview

Applications ( RNA & CKY )

Solutions without fine-grain threads

Solutions with fine-grain threads

Performance evaluation

Page 6: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Case Study 1: RNA- protein secondary structure prediction -

Algorithm simple node traversal + pruning

finding a path• satisfying certain condition• with largest weight

unbalanced tree

Page 7: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Case Study 2: CKY- context-free grammar parser -

calculation of matrix elements

depends on all s

She is a girl whose mother is a teacher.

calculation time significantlyvaries from element to element

actual size 100≒

Page 8: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
Page 9: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

To create a threadfor each node large overhead

communicationwith memory

Task Pool

P P P

Solution without Fine-grain Threads(RNA)

Page 10: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

calculating 1 element→ 0 ~ 200 synchronization

P P P

decision strategy?• trial & error• prediction

Solution without Fine-grain Threads(CKY )

how to implement?• small delay → simple spin• large delay → block wait

Page 11: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
Page 12: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Schematic [Taura et. al 96] = Scheme + future + touch [Halstead 85]

(define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2)))))

thread creation

synchronization

channel

Language with Fine-grain Threads

Page 13: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Thread Management in Schematic• Lazy Task Creation [Mohr et al. 91]

PE A PE B

future future

future

future

future future

future

future

future

stac

k future

future

future

Page 14: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Synchronization on Register

PE A PE B

• StackThreads [Taura 97]

register

memory

register

register

register register

registerregister

register

memory

register

memory

Page 15: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Synchronization by Code Duplication

heuristics to decide which to duplicate+

if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ...}

cont(c, v){ }

work A

work B ver. 1;

work B ver. 2;

work A work B(touch r)

simple spin

block wait

Page 16: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
Page 17: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

What description can be omittedin Schematic? Management of fine-grain tasks

Synchronization details

future ⇔ manipulation of task pool + load balance

touch ⇔ manipulation of comm. medium + aggressive optimizations

SchematicC + thread

Page 18: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Codes for Parallel Execution

int search_node(...){ if (condition) { } else { child = ...; ... search_node(...); ... ... ...}

C

(define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...)))

Schematic

whole: 1566 lines whole: 453 lines

parallel: 537 lines (34 %)

parallel: 29 lines (6.4 %)

for parallelexecution

RNA

Page 19: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Performance Evaluation(Condition) Sun Ultra Enterprise 10000

(UltraSparc 250MHz × 6464) Solaris 2.5.1 Solaris thread (user-level thread)

GC time not included Runtime type check omitted

Page 20: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Performance Evaluation(Sequential)

0

1

2

3

RNA CKY

norm

aliz

ed e

laps

ed t

ime

C Schematic

Page 21: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Performance Evaluation(Parallel)

0

10

20

30

40

50

0 10 20 30 40 50 60# of PEs

spee

dup

C (RNA) Schematic (RNA) C (CKY) Schematic (CKY)

Page 22: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Related Work

ICC++ [Chien et al. 97]• Similar study using 7 apps• Experiments on distributed memory machines• Focus on

• namespace management

• data locality

• object-consistency model

Page 23: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Conclusion

We demonstrated the usefulness of fine-grain multithread languages• Task pool-like execution with simple description• Aggressive optimizations for synchronization

We showed the experimental results• A factor of 2.8 slower than C• Scalability comparable to C

Page 24: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Performance Evaluation(Other Applications 1/2)

14.7

0

1

2

3

4

Fib Tak Qsort Knapsack Grobner SPLASH2

norm

aliz

ed e

laps

ed t

ime

C Schematic

Page 25: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Performance Evaluation(Other Applications 2/2)

0

10

20

30

40

50

0 10 20 30 40 50 60

# of PEs

spee

dup

Fib Tak Nqueen QsortKnapsack Puzzle QAP SPLASH2

Page 26: An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Identifying Overheads

0

200

400

600

800

1000

normal no poll no GCcheck

stolentagopt.

flagcheck

usesmalltag

globalvaropt.

C

norm

aliz

ed e

laps

ed t

ime