hardware-based job queue management for manycore architectures and openmp environments

Hardware-based Job Queue Management for Manycore Architectures and OpenMP

EnvironmentsJunghee Lee, Chrysostomos Nicopoulos, Yongjae

Lee, Hyung Gyu Lee and Jongman Kim

Presented by Junghee Lee

Introduction

• Manycore systems– Number of cores is increasing

• Challenges in scalability– Memory– Power consumption– Cache coherence protocol– Load balancing

Contents

• Introduction• Background

– Programming models– Motivation

• IsoNet• Fault-tolerance• Evaluation• Conclusion

Programming Models

• Parallel programming models– MPI– OpenMP

• Fine-grained parallelism– Emerging applications:

Recognition, Mining and Synthesis– Execution time of each computation kernel is very short

but it has abundant parallelism– Excessive overhead in multithreading

Job Queuing

• Creates jobs instead of threads– One thread per core is

created– Thread: a set of instructions

and states of execution– Job: a set of data that is

processed by a thread• Job queue

– Manages the list of jobs– Maintains load balance CPU CPU

Thread Thread

JobJobJob

Conflicts in Job Queue

• Chance of conflicts increases as:– The number of cores increases– The time taken to update the job queue increases– The job queue is accessed more frequently (job is short)

• Previous approaches– Distributed queues

• Load balance is maintained by job-stealing• The chance of conflicts in one local queue is decreased

– Hardware implementation• Time spent on updating the queue is reduced

Profile of SMVM

Number of cores8 16 32 64

Conflicts Stealing job Processing job

128 256

Objectives

• Requirements of load balancer– Scalability: conflict-free– Fault-tolerance

• The probability of faults increases exponentially as technology scales

• Contributions of this paper– Light weight micro-network for load balancing– Scalable even with more than a thousand cores– Comprehensive fault-tolerance support

Contents

• Introduction• Background• IsoNet

– Architecture– Implementation

• Fault-tolerance• Evaluation• Conclusion

System View

Microarchitecture of IsoNet Node

Dual ClockStack

JobCount

Job Job

Max Selector

Min Selector

Switch

How It Works

Tree-based routing: for fault-tolerance

Single Cycle Implementation

• Estimated critical path delay– 11.38 ns (87.8 MHz)– By Elmore delay model

• Single cycle implementation offers low hardware cost

Leaf node

Int.node

Rootnode

Int.node

Src or

DestSwt Swt

Src node

Dest node

Hardware Cost Estimation

Count Inst

Gate count

DCStack 204 1024

Selector

Leaf 0 641 Child 110 9282 Children 256 23 Children 480 294 Children 682 1

Switch 356 1024Root 59 1Total 674.50

674.50 * 240 * 4 = 647.52 K = 0.046% of 1.4 B (NVIDIA GTX285)

Contents

• Introduction• Background• IsoNet• Fault-tolerance

– Transparent mode– Reconfiguration mode

• Evaluation• Conclusion

Supporting Fault-Tolerance

• Transparent mode– For faulty CPUs– Bypass the corresponding IsoNet node

• Reconfiguration mode– For faulty IsoNet node– Operation

• When a fault is detected, all IsoNet nodes go into the reconfiguration mode

• Reconfigure the topology of IsoNet so that the faulty node is excluded

• Assign a new root node if the root node fails

Reconfiguration

Root Node Candidate

Contents

• Introduction• Background• IsoNet• Fault-tolerance• Evaluation

– Experimental setup– Results

• Conclusion

Experimental Setup

• Simulation framework– Wind River’s Simics full-system simulator– CMP with 4~64 x86 compatible cores– Fedora 12 with kernel 2.6.33

• Benchmarks from recognition, mining and synthesis applications– GS: Gauss-Seidel– MMM: Dense Matrix-Matrix Multiply– SVA: Scaled Vector Addition– MVM: Dense Matrix Vector Multiply– SMVM: Sparse Matrix Vector Multiply

Results

MMM (6,473 instructions)

07 cyc

Job stealing Carbon IsoNetCarbon speedup IsoNet speed up

SMVM (2,872 instructions)

07 cyc

5101520253035

404550

Beyond Hundred Cores

• MMM (6,473 instructions)

Number of cores4 8 16 32 64

Carbon IsoNet

128 256 512 1024

Profile of IsoNet

Conflicts Stealing job Processing job

Conclusion

• Scalability is one of key challenges in manycore domain• Scalability in load balancing is critical to utilize a number

of processing elements• This paper proposes a novel hardware-based dynamic

load distributor and balancer, called IsoNet• IsoNet also provides comprehensive fault-tolerance

support• Experimental results in a full-system simulation with real

applications demonstrate that IsoNet scales better than alternative techniques

Questions?

Contact info

Junghee Leejunghee.lee@gatech.eduElectrical and Computer EngineeringGeorgia Institute of Technology

Thank you!

hardware-based job queue management for manycore architectures and openmp environments

faulty node

job queuechance of conflicts

job queuingcreates jobs

topology of isonet

isonet nodes

local queue

new root node

yongjae lee

Documents

openmp - 超级计算中心 -...

c66x keystone training openmp: an overview. motivation: the...

c++ and openmp - · pdf file2 parco’07 terboven c++ and...

manycore computing with - sandia national laboratories...

programming irregular applications with openmp · 1 1...

openmp - oguzkaya.comkayaogz.github.io › ... › cours ›...

abinit school2019 parallelization abinit...

advanced programming of manycore systems

openmp offload evaluating support for features · openmp...

the openmp api for multithreaded programming sc'05 openmp

the epiphany manycore architecture - adapteva · the...

introduction to openmp -...

modern multicore and manycore architectures: modelling

shared memory parallelism - openmp sathish vadhiyar...

toward extreme-scale manycore architectures

1 parallel programming with openmp. 2 contents overview of...

c openmp - cc.u-tokyo.ac.jp · c openmp 1. openmp openmp...

openmp china mcp 1. agenda motivation: the need the openmp...

openmp api 5.0 page 1 openmp 5.0 api syntax reference...

parallel computing on manycore gpus