design issues smt and cmp architectures

DESIGN ISSUES:SMT and CMP Architectures

Why are Design issues important?They determine the performance measures of each processor in a precise manner.The issue slots usage limitations and its issues also determine the performance

Why Multithreading Today?ILP is exhausted, TLP is in.Large performance gap bet. MEM and PROC.Too many transistors on chipMore existing MT applications Today.Multiprocessors on a single chip.Long network latency, too.

DESIGN CHALLENGES OF SMTImpact of fine grained scheduling on single thread performance?A preferred thread approach sacrifices throughput and single threaded performanceUnfortunately with a preferred thread, the processor is likely to sacrifice some throughput

Reason for loss of throughput?Pipeline is less likely to have a mix of instructions from several threads resulting in a greater probability that either empty slots or a stall will occur

Design ChallengesLarger register file needed to hold multiple contexts.Not affecting clock cycle time, especially inInstruction issue- more candidate instructions need to be consideredInstruction completion- choosing which instructions to commit may be challengingEnsuring that cache and TLP conflicts generated by SMT do not degrade performance

Observation There are mainly two observationsPotential performance overhead due to multithreading is smallEfficiency of current superscalar is low with the room for significant improvement

A SMT processor works well ifNumber of compute intensive threads does not exceed the number of threads supported in SMT.Threads have highly different charecteristicsFor eg; 1 thread doing mostly integer operations and another doing mostly floating point operations.

It does not work well ifThreads try to utilize the same functional unitsAssignment problemsEg; a dual core processor system, each processor having 2 threads simultaneously2 computer intensive application processes might end up on the same processor instead of different processors

The problem here is the operating system does not see the difference between the SMT and real processors !!!

Transient Faults Faults that persist for a short durationCause: cosmic rays (e.g., neutrons)Effect: knock off electrons, discharge capacitorSolutionno practical absorbent for cosmic rays1 fault per 1000 computers per year (estimated fault rate)Future is worsesmaller feature size, reduce voltage, higher transistor count, reduced noise margin

Processor Utilization vs. LatencyR = the run length to a long latency eventL = the amount of latency

Fault Detection via SMTR1 (R2)InputReplicationOutputComparisonMemory covered by ECCRAID array covered by parityServernet covered by CRCR1 (R2)THREADTHREAD

Simultaneous Multithreading (SMT)

thread1thread2FunctionalUnitsInstructionScheduler

Simultaneous & Redundantly Threaded Processor (SRT)SRT = SMT + Fault Detection + Less hardware compared to replicated microprocessorsSMT needs ~5% more hardware over uniprocessorSRT adds very little hardware overhead to existing SMT+ Better performance than complete replicationbetter use of resources+ Lower costavoids complete replication

SRT Design ChallengesLockstepping doesnt workSMT may issue same instruction from redundant threads in different cyclesMust carefully fetch/schedule instructions from redundant threadsbranch mispredictioncache miss

Transient Fault Detection in CMPsCRT borrows the detection scheme from the SMT-based Simultaneously and Redundantly Threaded (SRT) processors and applies the scheme to CMPs.replicated two communicating threads (leading & trailing threads)compare the results of the two.

CRT executes the leading and trailing threads on different processors to achieve load balancing and to reduce the probability of a fault corrupting both threads

Transient Fault Detection in CMPsdetection is based on replication but to which extent?replicates register values (in register file in each core)but not memory values

CRTs leading thread commits stores only after checking, so that memory is guaranteed to be correct.

CRT compares only stores and uncached loads, but not register values, of the two threads.

An incorrect value caused by a fault propagates through computations and is eventually consumed by a store, checking only stores suffices for detection; other instructions commit without checking.

CRT uses a store buffer (StB) in which the leading thread places its committed store values and addresses. The store values and addresses of the trailing thread are compared against the StB entries to determine whether a fault has occurred. (one checked store reaches to the cache hierarchy)

Transient Fault Recovery for CMPsUnlike CRT, CRTR must not allow any trailing instruction to commit before it is checked for faults, so that the register state of the trailing thread may be used for recovery.

However, the leading thread in CRTR may commit register state before checking, as in CRT.

This asymmetric commit strategy allows CRTR to employ a long slack to absorb inter-processor latencies.

As in CRT, CRTR commits stores only after checking.

In addition to communicating branch outcomes, load addresses, load values, store addresses, and store values like CRT, CRTR also communicates register values.

Performance EvaluationForwarding: IP ForwardAuthentication: MD5Encryption: 3DESSSFGMTCMPSMTWorkloads have little ILPNeed to exploit packet-level parallelismCMP and SMT do just that

Systems must support some form of concurrent packet-level parallelismSMT and CMP are nearly equivalent, with SMT always coming out aheadWe can see that SS and FGMT have similar performance, CMP and SMT have similar performanceThe latter two are scalable with increased parallelism

Challenges with this approach

I-Cache:Instruction bandwidth I-Cache misses: Since instructions are being grabbed from many different contexts, instruction locality is degraded and the I-cache miss rate rises. Register file access time:Register file access time increases due to the fact that the regfile had to significantly increase in size to accommodate many separate contexts. In fact, the HEP and Tera use SRAM to implement the regfile, which means longer access times. Single thread performanceSingle thread performance significantly degraded since the context is forced to switch to a new thread even if none are available. Very high bandwidth network, which is fast and wide Retries on load empty or store full

To maximize SMT performanceIssue slotsFunctional unitsRenaming registers

design issues smt and cmp architectures

Documents