copyright 2013, toshiba corporation. dac2013 designer/user track scalability achievement by...

11
Copyright 2013, Toshiba Corporation. DAC2013 Designer/User Track Scalability Achievement by Low-Overhead, Transparent Threads on an Embedded Many-Core Processor Takeshi Kodaka, Akira Takeda, Shunsuke Sasaki, Akira Yokosawa, Toshiki Kizu, Takahiro Tokuyoshi, Hui Xu, Toru Sano, Hiroyuki Usui, Jun Tanabe, Takashi Miyamori and Nobu Matsumoto Center for Semiconductor Research and Development Toshiba Corporation

Upload: joel-hoover

Post on 24-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Copyright 2013, Toshiba Corporation.

DAC2013 Designer/User Track

Scalability Achievementby Low-Overhead, Transparent Threadson an Embedded Many-Core Processor

Takeshi Kodaka, Akira Takeda, Shunsuke Sasaki, Akira Yokosawa, Toshiki Kizu, Takahiro Tokuyoshi, Hui Xu, Toru Sano, Hiroyuki Usui, Jun Tanabe, Takashi Miyamori and Nobu Matsumoto

Center for Semiconductor Research and DevelopmentToshiba Corporation

2DAC2013

Background

• Requirements for embedded processors– Various types of processing

• Video Codecs (HEVC, H.264 , MPEG-2 , WMV , ...)• Face Detection/Recognition, Audio/Video playback, Mobile TV

– Wide range of required processing performance• Should deal with various types of products from mobile phone to

Tablets or more– Example: video decoding from QVGA 15fps to 1080p 60fps or

more

– Low cost and short time development that meets market requirement• Reuse existing software to reduce development cost

3DAC2013

Challenges

• What kind of hardware architecture to employ?– The number of cores should be easily increased/decreased

• How can we realize the scalable performance?– Parallelized application program that utilizes multiple cores

efficiently

• How can we realize the transparency?– Hiding the number of cores from application program

Multiple Core Architecture[xu2012low]

Our Proposed Scheduler

[xu2012low] A low power many-core SoC with two 32-core clusters connected by tree based NoC for multimedia applications, H. Xu, et al. VLSI Symposium 2012

4DAC2013

Our approach

A simple multiple core architecture + An application program independent of # of cores + An efficient parallel processing scheme Achieving Scalable performance

5DAC2013

Strategy to realize our approach• Strategy

– Developing an application independent of # of cores transparency

– Running the developed application on a multiple-core processor and achieving scalable performance proportional to # of cores scalable performance

• Scheme– Designed an efficient thread scheduler

• efficient management of threads may achievescalable performance

• the number of cores may be hidden

if a thread scheduler abstracts the cores

• Challenges– Minimizing overheads for execution– Hiding the number of cores from application program

6DAC2013

How to minimize overheads

• Defined unique properties for threads– A Thread never suspends to wait for

data• eliminate the overhead of thread

switching– A Thread becomes ready to run when

necessary data are all available• Managed a thread status using simple

counters– Simplify the dependency into

“the number of dependency“• this can be realized by simple

operations

7DAC2013

How to hide the number of cores

• Designed a distributed scheduler with a shared queue– ONLY ready threads are placed in a shared queue– A Thread dispatcher runs on each core– The dispatcher fetches a thread from the shared queue and executes

it

• To reduce access conflict for a shared queue• We use CAS (Compare And Swap) instruction

Core

sear

chThreadThreadThread Thread

fetch & executeCore

ThreadThreadThread

Thread

fetch & execute

Core

Thread

sear

ch

fetch & execute

ThreadDispatcher

ThreadDispatcher

ThreadDispatcher

8DAC2013

Implemented thread scheduler

• Our Thread Scheduler consists of three components– Dependency Controller, Thread Pool, and Thread Dispatcher

• Our Thread Scheduler ...– is low overhead for Scalable Performance– hides the number of cores from application for Transparency

DependencyController

Thread Pool

ThreadDispatcher

Core

Core

Thread Scheduler

ThreadDispatcher

core

Appl. reg

iste

r

Core

ThreadDispatcher

1 0 Thread Thread

3 1・・・・

Thread

ThreadThreadThreadThread

Thread

Thread

Thread

avai

labl

e

nece

ssar

y

fetc

h&

exe

cute

read

y

9

• Design goals for a many-core processor – Achieve scalable performance– Reuse existing software for a multi-core processor

• a many-core processor has to execute existing software efficiently • knowledge of the software is absolutely necessary

Software engineers and Hardware engineers collaborated closely to design a many-core processor

• Design cycles – use “Plan – Evaluate – Analyze – Improve” cycle– existing software is used through out evaluation – At 1st cycle,: detect issues of existing architecture– At 2nd cycle, improve and optimize

• Main design features from our development cycle– CAS instruction, multi-bank L2 cache, tree-based network on chip,

Designing a many-core processor

DAC2013

Plan

Evaluate usingSimulation

Analyze

Improve

10

• Used SAME application binary even if the number of cores is changed

These results confirms proposed thread scheduler achieves scalable performance with transparency!

Evaluation results

DAC2013

H.264 Decoding 1080p Super resolution (full HD to 4K2K)

ScalablePerformance

ScalablePerformance

Lack of READY threads# of ready threads < # of MPEs

11

Conclusions• Proposed a low-overhead thread scheduler

– It achieves scalable performance and transparency

– Reduces thread execution overheads

• defined unique properties for a thread

– A thread never suspends

– A thread becomes ready when all necessary data are available

• managed thread status by the number of dependencies

– Hides the number of core

• designed a distributed scheduler with a shared queue

• Confirmed performance scalability and transparency– Evaluated on a real 32-core many-core processor

– A scalable performance is achieved without modification of the application program

DAC2013

Our scheduler contributesto the reduction of the software development cost