analysis and modeling of advanced pim architecture design

1

Analysis and Modeling of Advanced PIM Architecture

Design Tradeoffs

Ed Upchurch

Thomas Sterling

California Institute of Technology

Jay B. Brockman

University of Notre Dame

Abstract

Processor in Memory or PIM architecture offers dramatic improvements in performance for computations

that exhibit poor locality. PIM provides high memory bandwidth and low access latency on-chip. Future PIMs may incorporate more nodes, multithreading for local latency hiding, and lightweight message-

driven computing to tolerate system-wide latencies. This paper describes a series of queuing simulation experiments and analytical studies using statistical steady-state parametric models to evaluate the design

tradeoff space of these advanced concepts in PIM. The results show a range of improvements as a function

of structural and operational parameters.

1. Introduction

Processor in Memory or PIM architecture incorporates arithmetic units and control logic directly on the

semiconductor memory die to provide direct access to the data in the wide row buffer of the memory. PIM

offers the promise of superior performance for certain classes of data intensive computing through a

significant reduction in access latency, a dramatic increase in available memory bandwidth, and expansion

of the hardware parallelism for flow control. Advances in PIM architecture under development incorporate

innovative concepts to deliver high performance and efficiency in the presence of low data locality. These

include the use of PIM to augment and compliment conventional microprocessor architectures, the use of a

large number of on-chip PIM nodes to expose a high degree of memory bandwidth, and the use of message-

driven computation with a transaction-oriented producer-consumer execution model for system-wide

latency tolerance. All of these have benefited from previous work and this study extends those experiences

to the domain of PIM. This paper explores the design space of several innovations being considered for

PIM through a set of statistical steady-state parametric models that are investigated by means of queuing

simulation and analyses.

While the advanced PIM concept is encouraging, it is not proven. In order to both prove the effectiveness

of this new class of architecture and to quantitatively characterize the design tradeoff space to enable

informed choices of resource allocation, a set of simulation experiments and analytical studies were

conducted. These include 1) the modeling of the interrelationship between the PIM components and their

host microprocessor, 2) an investigation of the optimal number of nodes that should be implemented on a

chip, and 3) an exploration of the global latency hiding properties of parcels between PIM devices and local

latency hiding properties on the PIM device. This paper describes these experiments, presents the results

and findings, and discusses their implications for the future design and operation of advanced PIM

architecture and the systems that incorporate them. Section 2 describes the basic concepts and identifies

2

important relevant prior work in the field. Section 3 describes the simulation and analysis experiments and

presents their results. Finally, section 4 discusses the implications of these findings for future PIM design

and briefly suggests work necessary to broaden and confirm these initial conclusions.

2. Background

Processing-in-memory encompasses a range of techniques for driving computation into a memory system.

This involves not only the design of processor architectures and microarchitectures appropriate to the

properties of on-chip memory, but also execution models and communication protocols for initiating and

sustaining memory-based program execution.

2.1 Reclaiming the Hidden Bandwidth

The key architectural feature of on-chip memory is the extremely high bandwidth that it provides. A single

DRAM macro is typically organized in rows with 2048 bits each. During a read operation, an entire row is

latched in a digital row buffer just after the analog sense amplifiers. Once latched, data can be paged out of

the row buffer to the processing logic in wide words of typically 256 bits. Assuming a very conservative

row access time of 20 ns and a page access time of 2 ns, a single on-chip DRAM macro could sustain a

bandwidth of over 50 Gbit/s. Much of PIM research has focused upon reclaiming this hidden bandwidth,

either through new organization for conventional architectures or through custom ISAs.

Several studies have demonstrated that simple caches designed for on-chip DRAM can yield performance

comparable to classical memory hierarchies, but with much less silicon area. In [SPN96], researchers at

Sun investigated the performance of very wide, but shallow caches that transfer an entire cache line in a

single cycle. Using a Petri-net model, they showed that as a result of the lower miss penalty, a PIM with a

simple 5-stage RISC pipeline running at 200 MHz would have comparable performance to a DEC Alpha

21164 running at 300 MHz, with less than one-tenth the silicon area. Work at Notre Dame showed similar

performance results for a sector cache implemented by adding tag bits directly to the row buffers in DRAM

[BZKJ98]. Early simulation results from the Berkeley IRAM project showed that in addition to improved

performance-per-area, PIM could also have much lower energy consumption than conventional

organizations [FPC+97].

Even greater performance gains are possible through architectures that perform operations on multiple data

words accessed from memory simultaneously. Many such designs have been implemented or proposed

[BKF+99, FPC+97, HKK+99, KGM+00, Kir02, LY99]. The Berkeley VIRAM has 13 Mbytes of DRAM,

a 64-bit MIPS scalar core, and a vector coprocessor with 2 pipelined arithmetic units with each organized

into 4 parallel vector lanes. VIRAM has a peak floating-point performance of 1.6 Gflop/s, and shows

significant performance improvements in multimedia applications over contemporary superscalar, VLIW,

and DSP processors [KGM+00]. The DIVA PIM employs a wide-word coprocessor unit supporting SIMD

operations similar to the Intel MMX or PowerPC Altivec extensions. Using a memory system enhanced

with DIVA PIMs produced average speedups of 3.3 over host-only execution for a suite of data-intensive

benchmark programs [HKK+99]. Memory manufacturer Micron’s Yukon chip is a 16 Mbyte DRAM with

a SIMD array of 256 8-bit integer ALUs that can sustain an internal memory bandwidth of 25.6 Gbytes/s

[Kir02].

2.2 Computation and Communication in Massively-Parallel PIM Systems

The benefits of PIM technology can be further exploited by building massively-parallel systems with large

numbers of independent PIM nodes (an integrated memory/processor/networking device). Many examples

of fine-grain MPPs have been proposed, designed and implemented in the past—for example [Hil81] and

others. All have faced stiff challenges in sustaining significant percentages of peak performance related to

the interaction of computation and communication, and there is no reason to assume that networked PIM

devices would be immune to the same problems. PIM does, however, provide a technology for building

massive, scalable systems at lower cost, and for implementing highly efficient mechanisms for coordinating

computation and communication.

One of the key potential cost advantages of PIM is the ability to reduce the overhead related to memory

hierarchies. In [MSS95], it was first suggested that a petaflops scale computer could be implemented with a

far lower chip count using PIM technology than through a network of traditional shared memory machines

3

or through a cluster of conventional workstations. The J-Machine was one of the computers envisioned as

using DRAM based PIM components for an MPP, although for engineering considerations the system was

eventually implemented in SRAM technology [DCC+]. Execube [Kog94] was the first true MIMD PIM,

with 8 independent processors connected in a binary hypercube on a single chip together with DRAM.

More recently, IBM’s original Blue Gene [Den00] and current BG/L designs [Adi+02] both use embedded

DRAM technology in components for highly scalable systems.

Although related, the semantics of requests made of a PIM system differ somewhat from messages in

classic parallel architectures. HTMT and related projects introduced the concept of parcels (parallel

communication elements) for memory-borne messages, which range from simple memory reads and writes,

through atomic arithmetic memory operations, to remote method invocations on objects in memory [SB99,

BKF+99].

There are various ways that one could characterize and set performance objectives for PIM networks

communicating through parcels. A useful approach is to view the PIM network as a transaction-processing

system, where two important, related figures of merit are the latency in servicing a single transaction and

the throughput, or number of transactions serviced per unit time. As with fine-grain MPPs, the keys to

performance for PIM systems are minimizing the overhead of context switching and communication, and

overlapping communication and useful computation wherever possible. Several projects have addressed

hardware support for low-overhead communication, notably the MDP [DCF+92] and J-Machine [NWD93].

Active messages [vECGS92] is a software solution that minimizes communication overhead by including

the address of a user-level routing with the message that efficiently unpacks the message and integrates it

into ongoing computation. A variety of architectures and execution models have also addressed support for

overlapping computation and communication, that all entail mechanisms for scheduling operations out of a

pool of available work. These include dataflow machines [AN90, PC90], multithreading [Smi78, Smi91,

CGSV93], and hybrids [Ian88, NA89]. PIM Lite is a recent PIM architecture and prototype

implementation that efficiently uses wide words out of memory to integrate multithreading and fast parcel

response with SIMD arithmetic operations [BKF+99, BKKK02].

2.3 The Need for Design Space Exploration

As the previous sections show, many architectural and implementation options currently exist for

exploiting PIM technology. What is lacking, however, is a framework for evaluating tradeoffs between

options in designing balanced, cost-effective systems: what follows is the beginnings of such a framework.

Specifically, we have developed a set of analytic and simulation models that help provide insight into some

of the key questions affecting the configuration of PIM systems.

The first set of analyses addresses tradeoffs in partitioning a computation into heavyweight/high

temporal locality threads running on a conventional host processor and lightweight/low temporal

locality threads running in PIM. Parameters of the model include the number of PIM nodes, the

percentage of the application with low temporal locality, and the system configuration.

The second set of analyses addresses tradeoffs involving the ability to utilize the high on-chip

memory bandwidth and the balance between memory and processor area on a chip. Parameters of

this model include the probability that a given memory access in an application hits in a row of

memory, which is related to how well concurrent operations such as SIMD or vectors could be

used.

The third set of analyses investigates how effectively parcels can hide latency (or overlap

communication and computation). This work is related to prior work in studying the effectiveness

of multithreading [SBCvE90], but is set in a PIM context.

The following sections provide the details of these models and their results.

4

3. Experiments and Results

HyPerformix Workbench (formerly called SES/workbench) [Hyp] was used for the queuing modeling.

MATLAB [Mat] and Excel [Mic] were used for the analytical modeling. Workbench is a hierarchical

transaction oriented discrete event simulation modeling tool. Workbench’s set of high-level simulation and

synchronization operations, extensive set of statistical and queuing functions coupled with the ability to

extend the language with embedded C code supports modeling large massively parallel systems. A

graphical user interface with model animation and tracing make Workbench a suitable rapid prototyping

tool for this work.

3.1 A Queuing Model of a Basic PIM-based System

The Workbench queuing model comprises a master or heavyweight processor (HWP) and a set of PIM or

lightweight processors (LWP) in the main memory as in the block diagram of Figure 1.

Although similar in form, the two classes of processor are distinguished by their operational parameter

values as shown in Table 1.

Also, the HWP includes a cache but experiences a relatively long access time to main memory on a cache

miss. The LWP has no cache but is physically adjacent to the memory row buffer and so exhibits much

shorter memory access times Figure 2 presents the simple queue model for the HWP and Figure 3

provides the corresponding queue model for the array of LWP and memories. Note that for simplicity, the

model treats the main memory accessed by the HWP and LWP as separate devices but this is simply an

artifact of convenience and does not impact the simulation results. Bank conflicts are not modeled but the

nature of the workload modeled for these experiments precludes this kind of resource contention so no

inaccuracies are introduced in the final results.

The experimental workload divides the operations between the HWP and the array of LWP. For those

threads of activity that exhibit high temporal locality such that good cache hit rates should be expected, the

HWP is scheduled to perform them. For those threads of activity that exhibit low or no temporal locality

that would result in very poor cache performance, the set of LWP/memory components are scheduled to

perform them. At any one time, either the HWP or LWP array is executing but not both. We also assume

that the LWP workload is partitionable in to a number of concurrent threads that are concurrent and

uniform in length, one per LWP. This execution flow is depicted in Figure 4.

While somewhat constraining, the experimental workload permits simple statistical characterization and is

representative of many important classes of real-world algorithmic behavior if by no means all. The

parameters used to specify the workload are also given in Table 1.

3.1.1 Experimental Results from the Queuing Simulation

Two experiments were performed: 1) a control run in which the HWP performed all of the work, and 2) the

test runs in which the low locality threads were performed on a set of LWP nodes. For both cases, the

amount of low locality work measured as the percentage of operations was varied across a parameter range

of between 0% and 100%. For the test runs, the number of LWP nodes was varied as well in a range typical

of a modest scale system. The performance gains of the test runs with respect to the control run were

calculated as a function of the fraction of LWP workload for different number of LWP nodes as shown in

Figure 5.

It is seen that even for a small amount of LWP work including PIMs in the system may double the

performance. If the application is data intensive, a significant portion of the total work is scheduled on the

array of LWP nodes and as much as an order of magnitude performance gain may be achieved. In the

extreme case where essentially all work resides on the LWP array, at least for some configurations, a factor

5

of 100X gain is observed. These results, if substantiated through further studies, imply important

advantages of PIM-based systems with respect to their conventional counterparts.

3.1.2 Analytical Model of PIM-based Operation

To better understand the simulated results, an analytical model was developed incorporating the same

operational parameters. The results derived from the simulation were reproduced with this analytical model

to an accuracy of between 5% and 18%. This encouraging result motivated a second analytical study to

expose the basic time to solution normalized to that of the HWP alone performing only high temporal

locality work; i.e. 0% LWP workload. The equations are given below:

Equation 1. Analytical Expression for Relative Execution Time

This formulation exposes a remarkable property. Totally unanticipated, in addition to the two independent

parameters of number of nodes (N) and percentage of LWP workload (%WL), a third orthogonal parameter,

here referred to as NB, was derived from the combined properties of the system configuration and

application workload. This theoretical model is plotted in Figure 6.

From this diagram, it is evident that a point of coincidence occurs at a specific value of N, independent of

%WL. The derived equation for NB also shows that it is orthogonal to N. For N > NB, time to solution with

PIM support will always be as good or better than the control system without PIM elements. If the form of

this relationship is sustained as the underlying model grows in fidelity, the finding will provide a strong

condition for superiority of PIM-based system architecture.

3.2 PIM Technology and Memory Bandwidth

A principal motivating factor for the exploitation of PIM technology and architecture is the opportunity to

greatly increase memory bandwidth with respect to conventional system structures Partitioning the on-chip

memory in to multiple memory/processor nodes increases the available on-chip memory bandwidth; the

total number of nodes being the product of the number of chips and the number of nodes per chip.

But increasing the number of nodes comes at a cost and may not deliver significant improved performance

to cost. As the memory block is subdivided, each new node requires additional logic for registers, data

paths, controls, interfaces, and for part of the memory stack itself that must hold data related to the

presence, management, and operation of the memory/processor node. Thus the cost of a PIM chip increases

with increased number of nodes while the total memory capacity per fixed size chip is decreased requiring

more PIM chips to provide the same user memory. The effective memory throughput is also limited by the

concurrency of memory accesses as determined by the user application program as well as the distribution

of those accesses. If there is little program parallelism, then having too many nodes will waste PIM

6

resources. A critical question for future PIM architecture is: how many nodes should be implemented in a

PIM based memory system of a given user memory capacity.

An analysis was conducted to model the dominant parameters and their quantitative interrelationships for

this important design trade-off issue. A generalized performance to cost parameter was devised such that

performance is equated to sustained bandwidth, b, and cost is the die area, a. Efficiency, , is the ratio of

sustained bandwidth per unit area and the maximum bandwidth to area that can be achieved. An abstract

measure of memory access concurrency is used to differentiate points in the design space. The total user

memory capacity, M, (which is measured in number of rows) remains constant and the total area increases

as the number of node partitions, n, is increased.

The area, a, is the sum of the areas for the user memory and the node overhead logic and overhead

memory. The area for a row of memory is given by Am, and the amount of area required for all of the

overhead logic, registers, and control for a single node is given by AP. MP represents this additional

overhead memory per processor, also given in terms of rows of memory.

( )mPPm AMAnAMa ×+×+×=

The overhead area per processor, V, measured in units equivalent to the area of a row of memory is given

by:

m

PP

A

AMV +=

and after a change of variables, the total area is given by:

+××= V

n

MAna m

To model sustained bandwidth, it is necessary to consider some estimate measure of application user

demand in terms of concurrency of access requests. As the user demand or request parallelism goes up and

the access pattern is uniform over all nodes (clearly there are exceptions to this), the probability of access to

any given node increases. At the finest-grain level, p is the probability that a given memory row is not

accessed in a given machine memory cycle. The number of rows of memory in a single node is given by

M/n. Employing a Bernoulli process to represent the probability that all rows in a node are not accessed,

meaning that in a given cycle that node did not perform a useful memory access, the probability of a failed

node cycle is:

nM

p=node)aforaccessprob(no

and the probability for a memory access at a node in a given cycle is:

nM

p−=1node)aforsprob(acces

BP gives the peak bandwidth per node with the total system bandwidth, b, given as:

−××= nM

P pBnb 1

Combining the relationships for total area, a, and sustained bandwidth, b, yields an expression for the

sought after performance to cost metric:

7

+

−×

=

Vn

M

p

A

B

a

b nM

m

P 1

To begin to reduce the number of parameters, we apply a change of variables such that:

Vn

Ms

×≡

and,

Vpr ≡

to provide a new formulation:

+−×

×=

1

1

s

r

VA

B

a

b s

m

P

Finally, it can be shown that the maximum value of b/a, (b/a)max, in the limit as n becomes very large

converges to [BP/(Am V)]. A measure of efficiency, , is defined as the ratio of the delivered bandwidth to

area and this maximum possible value such that:

( )[ ]VAB

ab

mP ×=ξ

and,

1

1

+−=

s

r s

ξ

Figure 7 presents the total efficiency with respect to a measure of user memory per node. The unit of

memory for the independent variable, s, is the amount of memory that would fit in to the equivalent area of

all the overhead space required to implement a node. Thus, the value s = 1.0 means that the total space of a

node is equally divided by its user memory and overhead resources. As s increases, the amount of user

memory per node increases and the number of nodes decrease. This is the memory intensive regime of PIM

design. Conversely, as s decreases, the amount of user memory per node decreases and the number of

nodes in the system increases. This is the node intensive regime of PIM design. The variable r represents

the probability that a node will not experience a memory access (likelihood of a miss), if that node has the

amount of memory equivalent in area to the area of the overhead resources of the node. As s increases, the

probability of a miss, rs, decreases (because r < 1.0) and the likelihood of a memory access for the node

increases, as would be expected. This reflects program concurrency and data access spatial locality. The

diagram demonstrates a broad range of optimality implying a preferred balance point of number of nodes

and concurrency of memory access requests.

8

3.3 Parcels Hide Latency

The latency of access to remote data can have a dramatic effect on the efficiency of execution. Message

driven processing enables decoupling of the computation to provide a producer-consumer or transaction

processing model. A class of active messages, parcels, has been developed to provide a lightweight

capability of moving work to its data. The format of a parcel for the MIND architecture [SB99, SZ02] is

shown in Figure 8 as an example of a typical parcel. It is proposed that the parcel model exploits

application parallelism to automatically overlap communications with computation and will minimize the

impact of latency even across large systems, assuming sufficient concurrency. In order to investigate this

opportunity and explore the design tradeoff space of a parcel based PIM architecture, a series of controlled

simulation experiments was conducted. A control experiment was carefully constructed consisting of a set

of identical process driven nodes. A test experiment or target was constructed as a set of parcel driven

nodes. A flat network as shown in Figure 9 having a fixed network latency between each node connects

each set of nodes. The process and transaction node hardware: local cache and memory; and ALU are

identical with each node running the same instruction mix ensuring that only effects of parcels are observed

rather than artifacts of different node hardware speeds.

In the process driven node a single process is running per node. This process executes an instruction mix of

ALU operations and memory reference operations (load/stores). For the load/store operations the data, with

certain probability, may be in local cache or local memory. If the data is in local cache or memory, access

is made and a cache or memory access latency is incurred. If the data is remote, the process sends a remote

memory request over the network, incurring a network latency delay, and suspends processing until the

data is returned. The process driven model acts as the experiment control with only one operation allowed

at a time. A process node can be in only one of the following states: idle; executing an ALU operation;

waiting for a cache access; waiting for a local memory access; waiting for a remote memory access;

performing a memory access for a remote request.

In the parcel driven node (see Figure 9), input parcels can queue up at each node but one and only one

thread is created from this parcel FIFO and allowed to run at a time. Only a single thread is active on a

transaction node, with the exception of the specific experiments, shown in Figure 10, designed to

investigate multithreading. This thread executes an instruction mix in the same manner as the control

process model. The difference is that when a remote memory reference is made, a remote thread create

(parcel), is sent over the network, incurring a network latency delay, and the thread terminates. As in the

process model, one and only one operation is allowed at a time. A transaction node can be in only one of

the following states: idle; executing an ALU operation; waiting for a cache access; waiting for a local

memory access.

The primary metric used to evaluate latency hiding by parcels is the ratio: total number of useful operations

completed in a fixed amount of time by the parcel driven nodes divided by total number of useful

operations completed in that time by the process driven nodes. The key design parameters selected for these

experiments were: degree of parallelism in an application workload; number of processing nodes in the

system; network latency; amount of actual remote memory accesses.

3.3.1 Model Description

The basic process model, Figure 11, represents a set of identical processing nodes connected by an

interconnection network. Each processing node consists of an ALU, cache and local memory. Each node

executes a single process consisting of statistically identical instruction mixes determined by modeling

parameters for the probability of: load/store; cache hit; remote memory reference. In the event of a remote

memory access, the request is sent over the network for the cache line and the requested data is returned

over the network. The requesting process stalls until the data is available.

The basic transaction model, Figure 12, represents a set of identical processing nodes connected by an

interconnection network. Each processing node consists of an ALU, cache and local memory. Each node

executes threads consisting of statistically identical instruction mixes determined by modeling parameters

for the probability of: load/store; cache hit; remote memory reference. In the event of a remote memory

access, the requesting thread initiates a remote thread create by sending a parcel over the network to the

9

node containing the remote data. The requesting thread then terminates and a new pending parcel

immediately instantiates the next thread to be processed. There is no waiting for a “returned” value.

3.3.2 Model Parameterization

Model parameters to allow a variety of latency hiding experiments and sensitivity analyses. Table 3 lists

key modeling parameters held fixed to ensure experiment control. These include: instruction mix,

determined by “p_mem” and p_hit”; ALU speed and memory access time (“mem_req”). Table 4 lists key

modeling parameters varied for this experiment. Varying the number of processing nodes scales system

size. The amount of parallelism extracted from an application is represented by the number of parcels

queued at each transaction node at the start of the simulation. There is no parallelism in the process driven

nodes. The number of local memory references is determined by the conditional probability of a memory

reference and the probability of a cache miss. The number of remote memory references is determined by

the conditional probability of a memory reference and the probability of a remote reference. The “p_local”

parameter range corresponds to ¼%, ½%, 1%, 2% and 4% of total memory (cache misses) references being

remote. Remote memory latency is determined by network plus memory latency parameters and remote

thread create (in the case of the transaction model).

3.3.3 Experiments

Three sets of experiments were conducted that compared the performance of the transaction model with the

process control model for a fixed time interval: 1) runs which investigated relative performance as a

function of system size (“nbr_nodes”) and amount of extracted parallelism (“deg_parallelism”); 2) runs

which investigated relative performance sensitivity to amount of extracted parallelism (“deg_parallelism”),

network latency (“net_latency”) and remote access frequency (“p_local”); 3) runs which investigated

relative performance sensitivity to amount of extracted parallelism (“deg_parallelism”) for multi-threading

and multi-threading with dual ported memory in the transaction model only. For each case, the models

were run for a fixed time period of one million cycles and relative performance was determined by total

amount of useful work done by the transaction model divided by the total amount of useful work completed

by the process model. For the sensitivity runs, system parameters were varied in a range typical of a

modest scale system.

3.3.4 Experimental Results

Results for the first experiment in which the network latency and percent of remote accesses were fixed at

1024 machine cycles and 1% respectively, are shown in Figure 13. As the level of parallelism is increased

the transactional model is capable of increasingly better performance over the process model until the gain

starts leveling off beyond degree 8, finally reaching around 3.7 for the parameter set selected. As expected

due to the uniform distributions used the results are independent of the number of nodes within

experimental error. It is worth noting that as a control the case of one transaction node gives a gain of 1

independent of the degree of parallelism as expected. It is further noted that the transaction model shows

small gain even at a level of parallelism = 1. This is due to the fact that the process model has a round trip

network latency for remote memory references while the transaction model has only a one way latency due

to creation of a remote thread by the parcel rather than returning the data to the originating node. The

expected gain is a function of the degree of parallelism, percentage of remote accesses and network latency.

The next experiment explored the sensitivity of these parameters for a fixed number of nodes.

Figure 14 summarizes the results of a sensitivity analysis on network latency (shown as repeated values of

64, 256, 1024, 4096 and 16284 cycles along the x-axis) and percentage of remote memory accesses as a

function of degree of parallelism (generating a family of curves for each degree of parallelism). For regions

of high degree of parallelism the transaction model shows significant gain increasing with increases in

network latency and percent of remote accesses. As the degree of parallelism decreases the transaction

model gain decreases and begins to level off for high network latency and remote access percent. This

leveling off is due to not having enough parallelism to offset the parcels in transit in the network with

useful work at the nodes. The nodes are becoming starved or in performance engineering terms the network

is becoming a system bottleneck. As one would expect the effect is greater for higher percent of remote

10

accesses but, as shown in Figure 14, for degrees of parallelism below 4 parcels per transaction node

initially all remote access percentages simulated show the effect.

For low network latency and remote memory accesses, process and transaction models show little

performance difference given sufficient parallelism, however when parallelism drops lower (in this case

degree of parallelism below 16) the process model begins to win slightly. The effect is less for remote

access percents (above 2% in this case).

The experiment described in Figure 10 allows multithreading in the parcel driven nodes while the process

driven model is unchanged. Figure 15 summarizes the results of the experiment. The degree of parallelism

again determines the number of parcels in each node input parcel queue at time zero with the difference

that processing overlap between ALU and local memory is allowed. A thread can be waiting for a local

memory access while another thread is running on the ALU or waiting for an access to the local cache.

Threads are not however interleaved on the ALU. A given thread will use the ALU and cache exclusively

until it makes a memory reference at which time it will immediately release the ALU and cache resources

and another thread can be immediately created by the next parcel in the parcel input queue, if the queue is

not empty.

This experiment is designed to determine the effects of allowing concurrent use of memory and ALU in the

parcel driven nodes. Memory is modeled as single and dual ported memory. Results are compared with the

process driven control model. The effects of allowing multi-threading and dual port memory in the

transaction model for fixed: system size (16 nodes), network latency (1024 cycles) and remote access

percent (1%) are compared with the control model. Multi-threading is a natural consequence of the parcels

and for the case shown a 2x increase in performance is achieved by multi-threading for sufficient degree of

parallelism. Combining dual ported memory on each transaction node with multi-threading further

increases performance for sufficient parallelism to increase average local memory utilization.

4. Discussion and Conclusions

In this paper, we’ve developed a set of analytic and simulation models for exploring tradeoffs in the PIM design space.

We may summarize the findings of these experiments as follows:

4.1 Interrelationships between PIM Components and a Host Processor

Augmenting the memory system of a host processor with PIM components can yield performance gains

ranging from moderate (a factor of 2 or less) to dramatic (an order of magnitude or more) for applications

that can be separated into regions of high or low temporal locality. The models show that adding even

small amounts of processing capability to the memory system can have significant impact. For data-

intensive applications where there is little or no data reuse, and where caches are of little value, PIM may

help enormously. The model that we developed for this study provides a strong foundation that

characterizes the region of operation in terms of three independent variables: the number of PIM nodes, the

fraction of work that can be assigned to PIM, and a third parameter that is both machine and application

dependent. While it may be difficult to calibrate these parameters for specific design points, by sweeping

them across a range, we are able to get a broad view of the design space and to recognize emerging trends.

In terms of ongoing research, this first study supports the direction taken by projects exploring PIM-

enabled memory for conventional hosts, such as Diva [HKK+99] and Cascade.

4.2 Number of Processing Nodes per PIM Chip

This study investigated the relationship between the numbers of PIM processing nodes per unit memory

density and processing efficiency, as measured by the ability to effectively use the supplied memory

bandwidth. One important finding from the model is that the relationship is non-monotonic, and that there

is indeed a region of optimality. For applications with high spatial locality or regular access patterns, the

model suggests that it is cost-effective to devote significant area to processing logic. For applications that

11

don’t have these characteristics, additional processing logic would provide little value and result in a waste

of area.

Chips including VIRAM [KGM+00], Diva [HKK+99], and Yukon [Kir02] demonstrate substantial

performance gains, leveraging memory bandwidth with SIMD or vector operations. At present, however,

there are few PIMs with multiple nodes per chip, such as Execube [Kog94] or the original Blue Gene

[Den00]. Our models suggest substantial opportunity lies in this direction.

4.3 Global Latency Hiding Using Parcels

This third and final study shows that parcels—based on and inspired by early work such as message-driven

computation [DCF+92, NWD93] and active messages [vECGS92, CGSV93] can have a dramatic effect in

tolerating system-wide latency, but significant medium to fine-grain parallelism must be exposed in the

applications to take advantage of this, and that efficient parcel handling mechanisms are required to realize

performance gains. Further, as prior work has shown for MPPs [SBCvE90], our model demonstrates that

multithreading at the node can have tremendous benefit in PIM systems. Finally, execution models

developed over a decade ago in the context of dataflow architectures such as Monsoon [PC90] and P-RISC

[NA89] may have new relevance in PIM technology.

References

[Adi+02] N.R. Adiga and et. al. An overview of the BlueGene/L supercomputer. In Proceedings of

Supercomputing (SC2002), Baltimore, MD, November 2002.

[AN90] Arvind and Rishiyur S. Nikhil. Executing a program on the MIT tagged-token dataflow

architecture. IEEE Transactions on Computers, 39(3):300–318, March 1990. Also appears in Proceedings

of PARLE87. Parallel Architectures and Languages Europe.pp.1–29, vol.2.

[BKF+99] Jay B. Brockman, Peter M. Kogge, Vincent W. Freeh, Shannon K. Kuntz, and Thomas L.

Sterling. Microservers: A new memory semantics for massively parallel computing. In Conference

Proceedings of the 1999 International Conference on Supercomputing, pages 454–463, Rhodes, Greece,

June 20–25, 1999. ACM SIGARCH.

[BKKK02] J.B. Brockman, E. Kang, S. Kuntz, and P. Kogge. The architecture and implementation of a

microserver-on-a-chip. Technical Report CSE TR02-05, University of Notre Dame CSE Dept., 2002.

[BZKJ98] J. B. Brockman, J. Zawodny, P. Kogge, E. Johnson, “Cache-in-Memory: A Lower Power

Alternative,” presented at Workshop on Power-Driven Microarchitecture, held in conjunction with the

International Symposium on Computer Architecture, Barcelona, Spain, June 1998

[CGSV93] David E. Culler, Seth Copen Goldstein, Klaus Erik Schauser, and Thorsten Von Eicken. TAM –

A compiler controlled Threaded Abstract Machine. Journal of Parallel and Distributed Computing,

18(3):347–370, July 1993.

[DCC+] William Dally, Andrew Chang, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen,

Richard Lethin, Michael Noakes, Peter Nuth, Ellen Spertus, Deborah Wallach, and D. Scott Wills. The j-

machine: A retrospective.

[DCF+92] W. J. Dally, A. Chien, J. A. S. Fiske, G. Fyler, W. Horwat, J. S. Keen, R. A. Lethin, M. Noakes,

P. R. Nuth, and D. S. Wills. The message driven processor: An integrated multicomputer processing

element. In International Conference on Computer Design, VLSI in Computers and Processors, pages 416–

419, Los Alamitos, Ca., USA, October 1992. IEEE Computer Society Press.

[Den00] Monty Denneau. Blue gene. In SC2000: High Performance Networking and Computing, pages

12

35–35, Dallas, TX, November 2000. ACM.

[Den00] Monty Denneau. Blue gene. In SC2000: High Performance Networking and Computing, Dallas,

TX, November 2000. ACM.

[FPC+97] Richard Fromm, Stylianos Perissakis, Neal Cardwell, Christoforos Kozyrakis, Bruce McGaughy,

David Patterson, Tom Anderson, and Katherine Yelick. The energy effciency of IRAM architectures. In

Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA-97), volume

25,2 of Computer Architecture News, pages 327–337, New York, June 2–4 1997. ACM Press.

[Hew77] C. Hewitt. “Viewing control structures as patterns of passing messages”, Journal of Artifical

Intelligence, 8(3):323--363, June 1977

[Hil81] W. Daniel Hillis. The connection machine. Technical Report AIM-646, Massachusetts Institute of

Technology, September 1981.

[HKK+99] Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss,

John Granacki, Apoorv Srivastava, William Athas, Jay Brockman, Vincent Freeh, Joonseok Park,

and Jaewook Shin. Mapping irregular applications to DIVA, A PIM-based data-intensive architecture. In

Supercomputing (SC’99), pages ??–??, Portland, Oregon, November 1999. ACM Press and IEEE Computer

Society Press.

[Hyp] www.hyperformix.com, HyPerformix, Inc., 4301 Westbank Drive, Bldg. A, Austin, TX 78746.

[Ian88] Robert A. Iannucci. Toward a dataflow/von neumann hybrid architecture. In Proc. 15th Annual Symposium on Computer Architecture, Computer Architecture News, pages 131–140. ACM, May 1988.

Published as Proc. 15th Annual Symposium on Computer Architecture, Computer Architecture News,

volume 16, number 2.

[KGM+00] Christoforos Kozyrakis, Joseph Gebis, David Martin, Samuel Williams, Ioannis Mavroidis,

Steven Pope, Darren Jones, and David Patterson. Vector IRAM: A media-enhanced vector processor with

embedded DRAM. In IEEE, editor, Hot Chips 12: Stanford University, Stanford, California, August 13–15,

2000, pages ??–??, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 2000. IEEE Computer

Society Press.

[Kir02] Graham Kirsch. Active memory device delivers massive parallelism. In Microprocessor Forum,

San Jose, CA, October 2002.

[Kog94] P. M. Kogge. EXECUBE - A new architecture for scalable MPPs. In Dharma P. Agrawal, editor,

Proceedings of the 23rd International Conference on Parallel Processing. Volume 1: Architecture, pages

77–84, Boca Raton, FL, USA, August 1994. CRC Press.

[LY99] G. Lipovski and C. Yu. The dynamic associative access memory chip and its application to SIMD

processing and full-text database retrieval. In IEEE International Workship on Memory Technology, Design

and Testing, pages 24–33, San Jose, CA, August 1999. IEEE, IEEE Computer Society.

[Mat] www.mathworks.com, The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760

[Mic] www.microsoft.com, Microsoft Corporation, One Microsoft Way, Redmond, WA 98052

[MSS95] P. C. Messina, T. A. Sterling, and P. H. Smith. Enabling Technologies for PetaFlops Computing.

MIT Press, 1995.

[NA89] Rishiyur S. Nikhil and Arvind. Can dataflow subsume von Neumann computing? In Proceedings

of the 16th Annual International Symposium on Computer Architecture, pages 262–272, June 1989.

13

[NWD93] M. D. Noakes, D. A. Wallach, and W. J. Dally. The J-machine multicomputer: An architectural

evaluation. In Lubomir Bic, editor, Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 224–236, San Diego, CA, May 1993. IEEE Computer Society Press.

[PC90] Gregory M. Papadopoulos and David E. Culler. Monsoon: An Explicit Token-Store Architecture.

In 17th International Symposium on Computer Architecture, number 18(2) in ACM SIGARCH Computer

Architecture News, pages 82–91, Seattle, Washington, May 28–31, June 1990.

[SBCvE90] R. Saavedra-Barrera, D. Culler, and T. von Eicken. Analysis of multithreaded architectures for

parallel computing. In Proceedings of the second annual ACM symposium on Parallel algorithms and

architectures, pages 169–178. ACM Press, 1990.

[Smi78] Burton J. Smith. A pipelined, shared resource MIMD computer. In Proceedings of the 1978 International Conference on Parallel Processing, pages 6–8, 1978.

[Smi91] Burton Smith. A massively parallel shared memory computer. In ACM-SIGACT; ACM-

SIGARCH, editor, Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and

Architectures, pages 123–124, Hilton Head, SC, July 1991. ACM Press.

[SPN96] Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk. Missing the memory wall: The case for

processor/memory integration. In 23rd Annual International Symposium on Computer Architecture (23rd

ISCA’96), Computer Architecture News, pages 90–101. ACM SIGARCH, May 1996.

[SB99] Thomas Sterling and Larry Bergman. A design analysis of a hybrid technology multithreaded

architecture for petaflops scale computation. In Conference Proceedings of the 1999 International

Conference on Supercomputing, pages 286–293, Rhodes, Greece, June 20–25, 1999. ACM SIGARCH.

[SZ02] T. Sterling and H. Zima. Gilgamesh: A multithreaded processor-in-memory architecture for

petaflops computing. In Supercomputing: High-Performance Networking and Computing, November 2002.

[vECGS92] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active

messages: a mechanism for integrated communication and computation. In Proceedings the 19th Annual

International Symposium on Computer Architecture, pages 256–266, Gold Coast, Australia, May 1992.

[WM95] William A. Wulf and Sally A. McKee. Hitting the memory wall: Implications of the obvious.

Computer Architecture News, 23(1):20–24, March 1995.

14

Figure 1. Microprocessor with PIMs

Table 1. Parametric Assumptions

15

Figure 2. SES Queuing Model of Heavyweight Processor

Figure 3. SES Queuing Model of Lightweight PIM Nodes

16

Figure 4. Threads Timeline

Figure 5. Simulation of Performance Gain

17

Figure 6. Effect of PIM on Execution Time with Normalized Runtime

Table 2. Metrics

18

10-2

10-1

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Memory per Node

Effi

cien

cy

Tradeoff of Bandwidth Efficiency versus Number of Nodes

0.00010.00050.0010.0050.010.050.10.250.50.70.9

Probability of missfor Balanced Node

Figure 7. Total Efficiency vs. a Measure of User Memory per Node

Parameter Description Value

mem_req memory access time (cycles) for cache line 45

p_mem probability of load/store operation 0.3

p_hit probability memory request is in local cache 0.9

Table 3. Key Fixed Parameters

Parameter Description Value

nbr_nodes number of nodes for given simulation run 1,2,4,8,16,32,64,128,256

deg_parallelism number processes or threads per node at time = 0 1,2,4,8,16,32,64,128, 256

net_latency end to end network latency (cycles) 64,256,1024,4096

p_local probability memory request is local 0.9975,0.995,0.99,0.98,0.96

Table 4. Key Sensitivity Analysis Parameters

19

Figure 8. Example Parcel Format

Figure 9. Parcel Simulation Latency Hiding Experiment

Destination AddressType Action Operand Continuation CRC

40 bits 8 8 64 bits 32 bits

0

431

24 bits448 bits

Local AddressDomain Port Region

Length Type Version

35

Action

Code

Operand

Length

4

Domain Port Tag

8 bits 20 bits



0

431

24 bits448 bits


Length Type Version

35

Action

Code

Operand

Length

4

Domain Port Tag



0

431

24 bits448 bits


Length Type Version

35

Action

Code

Operand

Length

4

Domain Port Tag

8 bits 20 bits

Output

Parcels

Remote Memory

Requests

Remote Memory

Requests

Local Memory

Process Driven Node Parcel Driven Node

Local Memory

ALU

ALU

Input

Parcels

Remote Memory

Requests

Remote Memory

Requests

Flat

NetworkNodesNodes

NodesControl

Experiment

Test

Experiment

Output

Parcels

Remote Memory

Requests

Remote Memory

Requests

Local Memory


Local Memory

ALU

ALU

Input

Parcels

Remote Memory

Requests

Remote Memory

Requests

Flat

NetworkNodesNodes

NodesControl

Experiment

Test

Experiment

20

Figure 10. Parcel Simulation Latency Hiding Experiment with Multithreading

Figure 11. Basic Process Model

Remote Memory

Requests

Local Memory


Local Memory

ALU

ALU

Input

Parcels

Output

Parcels

Remote Memory

Requests

Remote Memory

Requests

Remote Memory

Requests

Flat

NetworkNodesNodes

NodesControl

Experiment

Test

Experiment

Remote Memory

Requests

Local Memory


Local Memory

ALU

ALU

Input

Parcels

Output

Parcels

Remote Memory

Requests

Remote Memory

Requests

Remote Memory

Requests

Flat

NetworkNodesNodes

NodesControl

Experiment

Test

Experiment

21

Figure 12. Basic Transaction Model

Latency Hiding

Process vs Transaction Models

Curves for different number of nodes

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 4 8 16 32 64 128 256

Parallelism Level (parcels/node at time=0)

Raw

Co

mp

ari

so

n:

#in

str

execu

ted

tra

ns/#

instr

exec p

rocess

1 node

2 nodes

4 nodes

8 nodes

16 nodes

32 nodes

64 nodes

128 nodes

256 nodes

Figure 13.Latency Hiding and Degree of Parallelism

22

Sensitivity to Remote Latency and Remote Access Fraction

16 Nodes

deg_parallelism (pending parcels @ t=0 per node) = 256, 64,16,4,2,1

0.1

1

10

100

1000

64 256

1024

4096

1638

4 64 256

1024

4096

1638

4 64 256

1024

4096

1638

4 64 256

1024

4096

1638

4 64 256

1024

4096

1638

4 64 256

1024

4096

1638

4

Remote Memory Latency (cycles)

To

tal tr

an

sa

cti

on

al w

ork

do

ne

/To

tal p

roc

es

s

wo

rk d

on

e

1/4%

1/2%

1%

2%

4%

16

4

1

64

2

256

Figure 14. Sensitivity to Remote Latency and Remote Access frequency

23

Latency Hiding

Process vs transaction Model

16 Nodes, Network Latency = 1024 cycles,

1% of Actual Accesses to Memory are Remote

0

1

2

3

4

5

6

7

8

9

10

1 2 4 16 64 256

Parallelism Level (parcels/node @ time = 0)

To

tal T

ran

sa

cti

on

al w

ork

/To

tal P

roc

es

s W

ork

No Multi-threading

Multi-threading

Multi-threading & Dual Port Memory

Figure 15. Effects of Transactional Multi-threading and Dual Port Memory

analysis and modeling of advanced pim architecture design

Documents