romain dolbeau andrÉ seznec cash: revisiting hardware sharing in single-chip parallel processor...
Post on 03-Jan-2016
217 Views
Preview:
TRANSCRIPT
ROMAIN DOLBEAU ANDRÉ SEZNEC
CASH: REVISITING HARDWARE SHARING IN
SINGLE-CHIP PARALLEL PROCESSOR
Journal of Instruction Level Parallelism
(2006)
OutlineOutline
What is What is CASHCASH.. Introduction.Introduction. CMPCMP, , SMTSMT. Sharing resources in CASHCASH. Simulations and Results. Conclusion.
What is CASHWhat is CASH
CCMP AAnd SSMT HHybrid : introduces an intermediate design for on-chip thread parallelism in terms of design complexity and hardware sharing.
CASH retains resource sharing in the SMT when such a sharing can be made non-critical for implementation.
IntroductionIntroduction
Two approaches for implementing thread level parallelism on a chip have emerged and are now implemented by the industry:
CMPCMP Chip Multi-Processor: SMTSMT Simultaneous Multi-Threading
CMPCMP CChip MMulti-PProcessor
Design essentially reproduces (at chip level) the shared memory multi processor design that was used with previous generation machines.
Multiple of two or four relatively simple execution cores can be implemented on a single chip.
E.g. IBM Power processor.
Most of the benefits will be counterbalanced by a deeper pipeline.
CPU intensive applications will have to be parallelized or multithreaded.
SMTSMT SSimultaneous MMulti-TThreading
Designed to achieve uniprocess performance with multithread performance being a bonus.
SMT processor supports concurrent threads with very low granularity
Instructions from the parallel threads are concurrently progressing in the execution core and shares the hardware resources of the processor functional units caches predictors.
The main difficulty in SMT’s implementation: The implementation of a wide issue superscalar
processor leads to hardware complexities in • The renaming logic. • The register file. • The by pass network increase super-linearly.
CMPCMP and SMTSMT represent two extreme design points
CMP No execution resource apart from the L cache. The memory interfaces are shared. A process can’t benefit from any resource of other CPUs.
• After a context switch a process may migrate from processor Pi to processor Pj leading to loss of instruction cache data, cache, and branch prediction structures.
SMT Single process performance is privileged
• Total resource sharing Allows to benefit from prefetch effect and cache warming from other
threads in a parallel workload
CASHCASH
This paper proposed a median course (CASH) parallel processor for CMP And
SMT Hybrid Instead of an all-or- nothing sharing policy one can
share some of the resources.• Branch predictors instruction caches and data caches can be
shared among several processor cores on a single-chip parallel processor
On the other hand CASH keeps separated the major parts of the execution cores rename logic wake up and issue logic bypass network register
Revisiting resourcesharing for parallel processors
Sharing in CMP
Shared L2 Cache
Core 1
Private L1
Core 2
Private L1
Core 3
Private L1
Core 4
Private L1
Sharing in SMT
Shared L2 Cache
4 Way Thread Core
Shared L1
Sharing inCASH
Core 1 Core 2
Core 3 Core 4
Shared L2 Cache
Sh
ared
L1
Sh
ared
L1
Sh
ared
L1
Sh
ared
L1
For the sake of simplicity we assume the following pipeline organization:
Instruction fetch. Instruction decode. Register renaming. Instruction dispatch. Execution memory. Access. Write back. Retire.
Some of these functional stages may span over several pipeline cycles The hardware complexity of some of these functional stages increases
super linearly and even near quadratically. This is due to the dependencies that may exist among groups of
instructions that are treated in parallel
Complexity of wide issue superscalar core
I. Fetch I. Dec R. Ren & I. Disp
Exec. /Mem. Acc.
I. Ret.
Sharing resources in CASHCASH
Some hardware resources are needed by every Processor can be shared for CASHCASH processors. Long latency units. (Dividers) Instruction caches. (Processors are granted access to the
instruction cache in a round robin mode) Data caches. (uses bank interleaved structure).
Branch predictor.
Benefits of cache sharing
When a multi-programmed workload is encountered the capacity is dynamically shared among the different processes.
When the workload is a single process this situation is clearly beneficial since the process can exploit the whole resource
When a parallel workload is encountered sharing is also beneficial.
Instructions are shared i.e less total capacity would be needed and perfected by a thread for another one.
Experiment WorkExperiment Work We simulated two basic configurations for both
CASH and CMP Two four-way execution core processors CASH-2,
CMP-2 Four two-way execution core processors CASH-4,
CMP-4 CASHCASH and CMPCMP differ by:
The first level instruction, data caches, and the buses to the second level cache.
The non-pipelined integer unit used for complex instructions such as divide
The non-pipelined floating point unit used for complex instructions (divide, square root)
The branch predictor
Simulation ResultsSimulation ResultsComponentComponent LatencyLatency SizeSize AssociativelyAssociatively
L 1 data cacheL 1 data cache3 (CASH)3 (CASH)
2 (CMP)2 (CMP)
128 KB (CASH-4)128 KB (CASH-4)
64 KB (CASH-2)64 KB (CASH-2)
32 KB/core (CMP)32 KB/core (CMP)
4 (LRU)4 (LRU)
8-way (Banked)8-way (Banked)
L1 Instruction L1 Instruction cachecache 11
64 KB (CASH-4)64 KB (CASH-4)
32 KB (CASH-2)32 KB (CASH-2)
16 KB/core (CMP16 KB/core (CMP2 (LRU)2 (LRU)
L2 CacheL2 Cache15 (CASH)15 (CASH)
17 (CMP)17 (CMP)2 MB2 MB 4 (LRU)4 (LRU)
Main MemoryMain Memory 150150 InfiniteInfinite InfiniteInfinite
Memory Hierarchy (Latencies, Sizes)
Benchmark sets for Benchmark sets for CASHCASH and and CMPCMP
Subset of the SPEC CPUSPEC CPU benchmark suite 10 from the integer subset. 10 from the floating point subset Mixed subsets of integer and floating points.
All benchmarks were compiled using the Sun C and Fortran compilers
With –xO3 and fast optimizations for C and Fortran respectively
Results 1Results 1
Average execution time of integer benchmarks in groups on the cores configurations
Results 2
Context Switching
Parallel workload Benchmark
subset of the SPLASHSPLASH benchmarks suite parallelized using POSIX threads.
Traces were extracted from a thread execution after all threads were started.
All simulated benchmarks exhibited some level of inter-threads sharing and prefetching.
The positive effect of this sharing became more visible as the working set to cache size ratio became higher
By enlarging the working set or diminishing the cache size CASHCASH is the best performer whenever this ratio is high thanks to a higher hit ratio on the memory hierarchy
Conclusion
Instead of an all-or-nothing sharing policy CASHCASH implements separate execution cores
as on CMP CMP CASHCASH shares the memory structures, caches,
predictors, and rarely used functional units as on SMTSMT
The simulation results illustrate that CASHCASH competes favorably with a CMPCMP solution on most workloads.
Thank YouThank You
top related