an implementation and performance evaluation of language with fine-grain thread creation on shared...

Post on 02-Jan-2016

222 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation

on Shared Memory Parallel Computer

Yoshihiro Oyama, Kenjiro Taura,

Toshio Endo, Akinori Yonezawa

Department of Information Science, Faculty of Science,

University of Tokyo

Background

“Irregular” parallel applications• Tasks are not identified until runtime• synchronization structure is complicated

Languages with fine-grain threads• promising approach to handle the complexity

Motivation

Q: Are fine-grain threads really effective?

• Easy to describe irregular parallelism?• Scalable?• Fast?

Case studies to answer the Q are few

Many sophisticated designs and implementation techniqueshave been proposed so far, but

Goal

Case study to better understandthe effectiveness of fine-grain threads

C + Solaris threads

VS.

• program description cost• speed on 1 PE• scalability on 64PE SMP

in terms of

our language Schematic

approach w/o fine-grain threads

approach withfine-grain threads

Overview

Applications ( RNA & CKY )

Solutions without fine-grain threads

Solutions with fine-grain threads

Performance evaluation

Case Study 1: RNA- protein secondary structure prediction -

Algorithm simple node traversal + pruning

finding a path• satisfying certain condition• with largest weight

unbalanced tree

Case Study 2: CKY- context-free grammar parser -

calculation of matrix elements

depends on all s

She is a girl whose mother is a teacher.

calculation time significantlyvaries from element to element

actual size 100≒

To create a threadfor each node large overhead

communicationwith memory

Task Pool

P P P

Solution without Fine-grain Threads(RNA)

calculating 1 element→ 0 ~ 200 synchronization

P P P

decision strategy?• trial & error• prediction

Solution without Fine-grain Threads(CKY )

how to implement?• small delay → simple spin• large delay → block wait

Schematic [Taura et. al 96] = Scheme + future + touch [Halstead 85]

(define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2)))))

thread creation

synchronization

channel

Language with Fine-grain Threads

Thread Management in Schematic• Lazy Task Creation [Mohr et al. 91]

PE A PE B

future future

future

future

future future

future

future

future

stac

k future

future

future

Synchronization on Register

PE A PE B

• StackThreads [Taura 97]

register

memory

register

register

register register

registerregister

register

memory

register

memory

Synchronization by Code Duplication

heuristics to decide which to duplicate+

if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ...}

cont(c, v){ }

work A

work B ver. 1;

work B ver. 2;

work A work B(touch r)

simple spin

block wait

What description can be omittedin Schematic? Management of fine-grain tasks

Synchronization details

future ⇔ manipulation of task pool + load balance

touch ⇔ manipulation of comm. medium + aggressive optimizations

SchematicC + thread

Codes for Parallel Execution

int search_node(...){ if (condition) { } else { child = ...; ... search_node(...); ... ... ...}

C

(define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...)))

Schematic

whole: 1566 lines whole: 453 lines

parallel: 537 lines (34 %)

parallel: 29 lines (6.4 %)

for parallelexecution

RNA

Performance Evaluation(Condition) Sun Ultra Enterprise 10000

(UltraSparc 250MHz × 6464) Solaris 2.5.1 Solaris thread (user-level thread)

GC time not included Runtime type check omitted

Performance Evaluation(Sequential)

0

1

2

3

RNA CKY

norm

aliz

ed e

laps

ed t

ime

C Schematic

Performance Evaluation(Parallel)

0

10

20

30

40

50

0 10 20 30 40 50 60# of PEs

spee

dup

C (RNA) Schematic (RNA) C (CKY) Schematic (CKY)

Related Work

ICC++ [Chien et al. 97]• Similar study using 7 apps• Experiments on distributed memory machines• Focus on

• namespace management

• data locality

• object-consistency model

Conclusion

We demonstrated the usefulness of fine-grain multithread languages• Task pool-like execution with simple description• Aggressive optimizations for synchronization

We showed the experimental results• A factor of 2.8 slower than C• Scalability comparable to C

Performance Evaluation(Other Applications 1/2)

14.7

0

1

2

3

4

Fib Tak Qsort Knapsack Grobner SPLASH2

norm

aliz

ed e

laps

ed t

ime

C Schematic

Performance Evaluation(Other Applications 2/2)

0

10

20

30

40

50

0 10 20 30 40 50 60

# of PEs

spee

dup

Fib Tak Nqueen QsortKnapsack Puzzle QAP SPLASH2

Identifying Overheads

0

200

400

600

800

1000

normal no poll no GCcheck

stolentagopt.

flagcheck

usesmalltag

globalvaropt.

C

norm

aliz

ed e

laps

ed t

ime

top related