romain dolbeau andrÉ seznec cash: revisiting hardware sharing in single-chip parallel processor...

ROMAIN DOLBEAU ANDRÉ SEZNEC

CASH: REVISITING HARDWARE SHARING IN

SINGLE-CHIP PARALLEL PROCESSOR

Journal of Instruction Level Parallelism

(2006)

OutlineOutline

What is What is CASHCASH.. Introduction.Introduction. CMPCMP, , SMTSMT. Sharing resources in CASHCASH. Simulations and Results. Conclusion.

What is CASHWhat is CASH

CCMP AAnd SSMT HHybrid : introduces an intermediate design for on-chip thread parallelism in terms of design complexity and hardware sharing.

CASH retains resource sharing in the SMT when such a sharing can be made non-critical for implementation.

IntroductionIntroduction

Two approaches for implementing thread level parallelism on a chip have emerged and are now implemented by the industry:

CMPCMP Chip Multi-Processor: SMTSMT Simultaneous Multi-Threading

CMPCMP CChip MMulti-PProcessor

Design essentially reproduces (at chip level) the shared memory multi processor design that was used with previous generation machines.

Multiple of two or four relatively simple execution cores can be implemented on a single chip.

E.g. IBM Power processor.

Most of the benefits will be counterbalanced by a deeper pipeline.

CPU intensive applications will have to be parallelized or multithreaded.

SMTSMT SSimultaneous MMulti-TThreading

Designed to achieve uniprocess performance with multithread performance being a bonus.

SMT processor supports concurrent threads with very low granularity

Instructions from the parallel threads are concurrently progressing in the execution core and shares the hardware resources of the processor functional units caches predictors.

The main difficulty in SMT’s implementation: The implementation of a wide issue superscalar

processor leads to hardware complexities in • The renaming logic. • The register file. • The by pass network increase super-linearly.

CMPCMP and SMTSMT represent two extreme design points

CMP No execution resource apart from the L cache. The memory interfaces are shared. A process can’t benefit from any resource of other CPUs.

• After a context switch a process may migrate from processor Pi to processor Pj leading to loss of instruction cache data, cache, and branch prediction structures.

SMT Single process performance is privileged

• Total resource sharing Allows to benefit from prefetch effect and cache warming from other

threads in a parallel workload

CASHCASH

This paper proposed a median course (CASH) parallel processor for CMP And

SMT Hybrid Instead of an all-or- nothing sharing policy one can

share some of the resources.• Branch predictors instruction caches and data caches can be

shared among several processor cores on a single-chip parallel processor

On the other hand CASH keeps separated the major parts of the execution cores rename logic wake up and issue logic bypass network register

Revisiting resourcesharing for parallel processors

Sharing in CMP

Shared L2 Cache

Core 1

Private L1

Core 2

Private L1

Core 3

Private L1

Core 4

Private L1

Sharing in SMT

Shared L2 Cache

4 Way Thread Core

Shared L1

Sharing inCASH

Core 1 Core 2

Core 3 Core 4

Shared L2 Cache

Sh

ared

L1

Sh

ared

L1

Sh

ared

L1

Sh

ared

L1

For the sake of simplicity we assume the following pipeline organization:

Instruction fetch. Instruction decode. Register renaming. Instruction dispatch. Execution memory. Access. Write back. Retire.

Some of these functional stages may span over several pipeline cycles The hardware complexity of some of these functional stages increases

super linearly and even near quadratically. This is due to the dependencies that may exist among groups of

instructions that are treated in parallel

Complexity of wide issue superscalar core

I. Fetch I. Dec R. Ren & I. Disp

Exec. /Mem. Acc.

I. Ret.

Sharing resources in CASHCASH

Some hardware resources are needed by every Processor can be shared for CASHCASH processors. Long latency units. (Dividers) Instruction caches. (Processors are granted access to the

instruction cache in a round robin mode) Data caches. (uses bank interleaved structure).

Branch predictor.

Benefits of cache sharing

When a multi-programmed workload is encountered the capacity is dynamically shared among the different processes.

When the workload is a single process this situation is clearly beneficial since the process can exploit the whole resource

When a parallel workload is encountered sharing is also beneficial.

Instructions are shared i.e less total capacity would be needed and perfected by a thread for another one.

Experiment WorkExperiment Work We simulated two basic configurations for both

CASH and CMP Two four-way execution core processors CASH-2,

CMP-2 Four two-way execution core processors CASH-4,

CMP-4 CASHCASH and CMPCMP differ by:

The first level instruction, data caches, and the buses to the second level cache.

The non-pipelined integer unit used for complex instructions such as divide

The non-pipelined floating point unit used for complex instructions (divide, square root)

The branch predictor

Simulation ResultsSimulation ResultsComponentComponent LatencyLatency SizeSize AssociativelyAssociatively

L 1 data cacheL 1 data cache3 (CASH)3 (CASH)

2 (CMP)2 (CMP)

128 KB (CASH-4)128 KB (CASH-4)


32 KB/core (CMP)32 KB/core (CMP)

4 (LRU)4 (LRU)

8-way (Banked)8-way (Banked)

L1 Instruction L1 Instruction cachecache 11



16 KB/core (CMP16 KB/core (CMP2 (LRU)2 (LRU)

L2 CacheL2 Cache15 (CASH)15 (CASH)

17 (CMP)17 (CMP)2 MB2 MB 4 (LRU)4 (LRU)

Main MemoryMain Memory 150150 InfiniteInfinite InfiniteInfinite

Memory Hierarchy (Latencies, Sizes)

Benchmark sets for Benchmark sets for CASHCASH and and CMPCMP

Subset of the SPEC CPUSPEC CPU benchmark suite 10 from the integer subset. 10 from the floating point subset Mixed subsets of integer and floating points.

All benchmarks were compiled using the Sun C and Fortran compilers

With –xO3 and fast optimizations for C and Fortran respectively

Results 1Results 1

Average execution time of integer benchmarks in groups on the cores configurations

Results 2

Context Switching

Parallel workload Benchmark

subset of the SPLASHSPLASH benchmarks suite parallelized using POSIX threads.

Traces were extracted from a thread execution after all threads were started.

All simulated benchmarks exhibited some level of inter-threads sharing and prefetching.

The positive effect of this sharing became more visible as the working set to cache size ratio became higher

By enlarging the working set or diminishing the cache size CASHCASH is the best performer whenever this ratio is high thanks to a higher hit ratio on the memory hierarchy

Conclusion

Instead of an all-or-nothing sharing policy CASHCASH implements separate execution cores

as on CMP CMP CASHCASH shares the memory structures, caches,

predictors, and rarely used functional units as on SMTSMT

The simulation results illustrate that CASHCASH competes favorably with a CMPCMP solution on most workloads.

Thank YouThank You

romain dolbeau andrÉ seznec cash: revisiting hardware sharing in single-chip parallel processor...

Documents