an implementation and performance evaluation of language with fine-grain thread creation on shared...

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation

on Shared Memory Parallel Computer

Yoshihiro Oyama, Kenjiro Taura,

Toshio Endo, Akinori Yonezawa

Department of Information Science, Faculty of Science,

University of Tokyo

Background

“Irregular” parallel applications• Tasks are not identified until runtime• synchronization structure is complicated

Languages with fine-grain threads• promising approach to handle the complexity

Motivation

Q: Are fine-grain threads really effective?

• Easy to describe irregular parallelism?• Scalable?• Fast?

Case studies to answer the Q are few

Many sophisticated designs and implementation techniqueshave been proposed so far, but

Case study to better understandthe effectiveness of fine-grain threads

C + Solaris threads

• program description cost• speed on 1 PE• scalability on 64PE SMP

in terms of

our language Schematic

approach w/o fine-grain threads

approach withfine-grain threads

Overview

Applications ( RNA & CKY )

Solutions without fine-grain threads

Solutions with fine-grain threads

Performance evaluation

Case Study 1: RNA- protein secondary structure prediction -

Algorithm simple node traversal ＋ pruning

finding a path• satisfying certain condition• with largest weight

unbalanced tree

Case Study 2: CKY- context-free grammar parser -

calculation of matrix elements

depends on all s

She is a girl whose mother is a teacher.

calculation time significantlyvaries from element to element

actual size 100≒

To create a threadfor each node large overhead

communicationwith memory

Task Pool

Solution without Fine-grain Threads(RNA)

calculating 1 element→ 0 ～ 200 synchronization

decision strategy?• trial & error• prediction

Solution without Fine-grain Threads(CKY )

how to implement?• small delay → simple spin• large delay → block wait

Schematic [Taura et. al 96] = Scheme + future + touch [Halstead 85]

(define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2)))))

thread creation

synchronization

channel

Language with Fine-grain Threads

Thread Management in Schematic• Lazy Task Creation [Mohr et al. 91]

PE A PE B

future future

future

future future

future

k future

future

Synchronization on Register

PE A PE B

• StackThreads [Taura 97]

register

memory

register

register register

registerregister

register

memory

register

memory

Synchronization by Code Duplication

heuristics to decide which to duplicate+

if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ...}

cont(c, v){ }

work A

work B ver. 1;

work B ver. 2;

work A work B(touch r)

simple spin

block wait

What description can be omittedin Schematic? Management of fine-grain tasks

Synchronization details

future ⇔ manipulation of task pool ＋ load balance

touch ⇔ manipulation of comm. medium ＋ aggressive optimizations

SchematicC + thread

Codes for Parallel Execution

int search_node(...){ if (condition) { } else { child = ...; ... search_node(...); ... ... ...}

(define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...)))

Schematic

whole: 1566 lines whole: 453 lines

parallel: 537 lines (34 %)

parallel: 29 lines (6.4 %)

for parallelexecution

Performance Evaluation(Condition) Sun Ultra Enterprise 10000

(UltraSparc 250MHz × 6464) Solaris 2.5.1 Solaris thread (user-level thread)

GC time not included Runtime type check omitted

Performance Evaluation(Sequential)

RNA CKY

C Schematic

Performance Evaluation(Parallel)

0 10 20 30 40 50 60# of PEs

C (RNA) Schematic (RNA) C (CKY) Schematic (CKY)

Related Work

ICC++ [Chien et al. 97]• Similar study using 7 apps• Experiments on distributed memory machines• Focus on

• namespace management

• data locality

• object-consistency model

Conclusion

We demonstrated the usefulness of fine-grain multithread languages• Task pool-like execution with simple description• Aggressive optimizations for synchronization

We showed the experimental results• A factor of 2.8 slower than C• Scalability comparable to C

Performance Evaluation(Other Applications 1/2)

Fib Tak Qsort Knapsack Grobner SPLASH2

C Schematic

Performance Evaluation(Other Applications 2/2)

0 10 20 30 40 50 60

# of PEs

Fib Tak Nqueen QsortKnapsack Puzzle QAP SPLASH2

Identifying Overheads

normal no poll no GCcheck

stolentagopt.

flagcheck

usesmalltag

globalvaropt.

an implementation and performance evaluation of language with fine-grain thread creation on shared...

finegrain threadssolutions

finegrain threadsrnato

finegrain threadscky

finegrain thread creation

grain threadsapproach

management of fine

usefulness of fine

effectiveness of fine

Documents

karate kyokushin oyama

gluepy: a simple distributed python programming framework...

1 a distributed task scheduler optimizing data transfer time...

executing parallel programs with potential bottlenecks...

yoshihiro oyama, kenjiro taura, toshio endo, akinori...

taura 24 mc w top

a really practical guide to parallel/distributed processing...

kenjiro arisawa , shigenobu yazawa , yusuke atsumi hiroshi...

may/01/2000hips 20001 online computation of critical paths...

mas oyama the leyend

naotsugu oyama, md, phd, mba

a quest for unified, global view parallel …...a quest for...

essential karate - mas oyama 1979

fault-tolerant computing basics kenjiro taura 2003 may 28...

lamborghini - taura 24 mcs

monte carlo go has a way to go haruhiro yoshimoto (*1)...

8016_algoritm_algoritm taura si ninfa

yvonne taura

cv koichiro oyama

high performance wide-area overlay using deadlock-free...