adam: run-time agent-based distributed application mapping for on-chip communication

ADAM: Run-time Agent-based Distributed Application Mapping for

on-chip Communication

林鼎原Department of Electrical Engineering

National Cheng Kung UniversityTainan, Taiwan, R.O.C

112/04/20

1

2

Abstract(1/5)Design-time decisions can often only cover certain scenarios and fail in

efficiency when hard-to-predict system scenarios occur.

This drives the development of run-time adaptive systems.

To the best of our knowledge, we are presenting the first scheme for a runtime application mapping in a distributed manner using agents targeting for adaptive NoC-based heterogeneous multi-processor systems.

3

Abstract(2/5)Some events that may require a re-mapping at run-time for an adaptive

system where design-time mapping algorithms fail are given below: On-line detection of hardware faults. To minimize run-time system costs (i.e. to save energy because of the low battery

status). When the user requirements change, e.g. the user wants to switch video playback to

a higher resolution.The system is analyzed during run-time and self-adapts in terms of when

and how a mapping algorithm should be invoked.

4

Abstract(3/5)Our novel contributions are as follows:

(1) We provide a run-time agent-based distributed mapping algorithm for next generation self-adaptive heterogeneous MPSoCs.

Our mapping algorithm is composed of two main parts: (a) virtual cluster selection and cluster reorganization at run-time (b) a mapping algorithm inside a cluster at run-time.

(2)We propose a run-time cluster negotiation algorithm that generates virtual clusters

to solve the problems of the centralized mapping algorithm (ex: single point of failure).

(3) We present a low cost heuristic-based mapping algorithm in terms of execution cycles on any instruction set processor that minimizes the communication related energy consumption.

5

Abstract(4/5)

Small system with few tiles: low traffic, low computational effort但當擴充到 hundreds of thousands of cores 會發生一些問題。

6

Abstract(5/5)

With hundreds or thousands of cores Scalability issues Single point of failure: of the whole chip! High computation complexity

於是提出 ( 右圖 ) 方法 Hierarchical Approach

7

Some Definitions(1/4)In the following we introduce our run-time Agent-based Distributed

Application Mapping (ADAM) for a heterogeneous MPSoC with a NoC

Definition 1: An application communication task graph (CTG) is a directed graph Gk = (T,F) .

T is a set of all tasks of an application. fi,j F ∈ is a set of all flows between connected tasks ti and tj annotated by the inter-

task bandwidth requirement.Definition 2: A heterogeneous MPSoC architecture in a NoC platform

HMPSoCNoC is a directed graph P = (N, V ) vertices N is a set of tiles ni vi,j V ∈ present an edge, the physical channel between two tiles ni and nj . A tile , ni N ∈ is composed of: a heterogeneous PE, a network interface, a router,

local memory and a cache.

8

Some Definitions(2/4)Definition 3: A cluster is a subset Ci N, ⊆

N is the set of tiles nj that belong to the HMPSoCNoC a virtual cluster Cvi, is a cluster where there are no fixed boundaries to decide which

tiles are included and which tiles are not. It can be created, resized and destroyed at run time.

Definition 4: An agent Ag is a computational entity, which acts on behalf of others.

The properties of an agent in our scheme are: an agent is a smaller task closer to the system It must do resource management It may need memory to store state information for the resources it must be executable on any processing element it must be migratable it must be recoverable it may be destroyed if the cluster no longer exists.

9

Some Definitions(3/4)Definition 5: A cluster agent CA Ag ∈ is an agent that is responsible for

mapping operations within the cluster Ci. The cluster agent is located in the processing element

where the index j of pj denotes that the cluster agent can be mapped to any PE of the cluster.

Definition 6: A global agent GA is an agent that stores the information for performing the mapping operations to a selected cluster.

It stores information regarding the current usage of communication and computation resources for each cluster and this information is used for selection and re-organization of the clusters

GA is movable and the stored information is light-weight and easily.

10

Some Definitions(4/4)Definition 7: The application mapping function is given by m :

T t∈ i → nj N∈ .Definition 8: A binding is a function b : , ,

T is the set of all tasks of an application and Tps is the set of the PE types that are used on the HMPSoCNoC.

The function assigns each task ti of the CTG to a favorable type of PE. After the binding operation is completed, the tasks are allowed to be mapped only to

PEs of the type given by the binding function b.

11

The ADAM Flow(1/3)An overview of our ADAM system is presented in Fig. 1. The run-time mapping in our scheme is achieved by using a negotiation

policy among Cluster Agents (CAs) and Global Agents (GAs) of a certain instance of time distributed over the whole chip.

In Fig. 1 an application mapping request is sent to the CA of the requesting cluster which receives all mapping requests and negotiates with the GAs.

The GAs have global information about all the clusters of the NoC in order to make decisions onto which cluster the application should be mapped to.

12

The ADAM Flow(2/3)Possible replies to this mapping request are:

1. When a suitable cluster of the application exists then the GAs inform the requesting source CA and the requesting source CA asks the suitable destination CA for the actual mapping of the application.

2. When no suitable clusters are found by the GAs then the GAs report the next most promising cluster where it is possible to map the application to after task migration which is negotiated between the GA and the CA to make this cluster suitable for the mapping.

3. When neither a suitable cluster nor a candidate cluster for task migration are found, then the re-clustering concept is used.

If all the above-mentioned options do not lead to a successful mapping (the application and the system constraints are not met), then the mapping request is refused and reported to the requester.

.

13

The ADAM Flow(3/3)

14

Cluster Negotiation Algorithm(1/5)The algorithms have the following important input and output data objects:

The application CTG, G with required computational resource profiles for each task. G is given by a set of entries for each flow: entry = (idsrc, iddst, bwreq, lat, RRtp).

Idsrc and iddst are the id of the source and destination task of the flow bwreq is the required bandwidth of the flow lat is the communication latency RRtp is the resource requirement on each PE type that is needed for a task to ensure a

successful execution. The state information about all clusters are stored in a summarized format by the GAs

(Table 1 and data object nhistc).

15

Cluster Negotiation Algorithm(2/5)Energy Model: To make a binding decision the amount of energy

consumption for different PE types at different resource requirement levels is needed.

We take an example from Fig. 2 (b) for the PE type tp2 the energy consumption is specified by two values:

tp2 : (4X, 12X) that means that each PE of type tp2 consumes 4 units of energy (static energy consumption) in a fixed time when it uses no processing resources

12 units of energy when it consumes the complete PE resources otherwise E = u ・ (E[100%] − E[0%]) + E[0%].

Fig2.

16

Cluster Negotiation Algorithm(3/5)thist[] and nhistc[] are two data objects that store the resource requirement histograms within the local memory of the CAs and GAs

thist for the required resources for the tasks nhistc for the actual PE resource usage status of the cluster c (i.e. Fig. 2 (e), (f)). Classify tasks by their computation resource requirements

t task

b(t) binding

tp PE type

k class

u(tp,t) res. req.

ncla # of classes

The matching of the two data objects nhistc and thist ----equation(1)

17

Cluster Negotiation Algorithm(4/5)In Fig. 2 we present an example of the cluster searching procedure.

The task graph of an application that is requested to be mapped is shown in Fig. 2(a). The energy consumed by various PE types in different resource requirement levels is

given in Fig. 2(b) The resource requirements of the tasks is given in Fig. 2(c). It is used to calculate the actual required energy consumption for every task on

different types of PEs (Fig. 2(d)). Fig. 2(e) shows the resource requirement profile to create a histogram corresponding

to the data object thist[] Fig. 2(f) presents the histogram nhistc[] for a cluster. Fig. 2(g) presents the new binding and the selection of the cluster.

18

Cluster Negotiation Algorithm(5/5)

20

The Mapping Algorithm(1/5)To decide to which tile of a particular PE type a task should be mapped, a

heuristics is used, described by the cost function c(t, n),for the selection of a tile nj for a given task ti.

D(n) is the average distance of a tile to all other tiles of the cluster. d(k) is the Manhattan distance between the mapped tasks, vol(k) is the communication volume between the connected tasks RR(nj) is the resource requirement of the PE that will be assigned for the task bwt(nj) is the total bandwidth requirement of the tasks on the tile.

21

The Mapping Algorithm(2/5) In the following, Alg. 2 is explained using an example (see Fig.

5).

22

The Mapping Algorithm(3/5)In Fig. 5 (a) we present a task graph, whose tasks are grouped by the

binding function (shown in different colors) in the earlier negotiation stage. In Fig. 5 (b) a part of the tiles of the current cluster is presented.In Fig. 5(f) presents the computational resource requirements for each task

of the task graph.In Fig. 5(g) shows the current resources in use of some of these tiles .the availability of the resources is presented by the ordered column in a

table (Fig. 5 (d)).In Fig. 5 (e) we see the first set of flows ftp2 that connect PEs of PE type 2:

{f12, f13, f34}. The flows are sorted in a decreasing order according to their bandwidth requirements.

The result of a successful mapping is illustrated in Fig. 5 (c).

23

The Mapping Algorithm(4/5)The pseudo code of the run-time mapping algorithm inside each cluster is

presented in Algorithm. 2. The input data is the CTG of the application.

The CTG contains the communication costs for each flow fij between the tasks ti and tj The model tileLUT,clu of the HMPSoCNoC that stores the current state of

the used computation and communication resources of that particular cluster.

The tile-LUT tileLUT,clu contains each tile’s current computation resource usage, the type of the PE of this tile tpPE, and the current bandwidth usage for each link

The output (mpng) is the mapping of tasks to tiles of the network which is used to allocate the tiles physically on the network.

24

The Mapping Algorithm(5/5)

25

Result(1/)

Result comparison on a system with 2048 tiles 7times lower computational effort compared to Nearest

Neighbor

09.713863/98304

26

Result(2

64x64 NoC:2551/238.9 10.7 times lower traffic in ADAM compared to a centralized schemes

27

ConclusionWe have introduced the first scheme for a run-time application mapping in

a distributed manner using an agent-based approach. We target adaptive NoC-based heterogeneous multi-processor systems.Provides 7 times lower computational effort compared to Nearest Neighbor

(NN) heuristics10.7 times lower traffic produced by this mapping functionality compared to

a centralized scheme

adam: run-time agent-based distributed application mapping for on-chip communication

Documents