chapter 2-1: cpus

31
Chapter 2-1: CPUs Soo-Ik Chae © 2007 Elsevier 1 Topics CPU metrics. Categories of CPUs. CPU mechanisms. High Performance Embedded Computing © 2007 Elsevier 2

Upload: others

Post on 10-Apr-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 2-1: CPUs

Chapter 2-1: CPUs

Soo-Ik Chae

© 2007 Elsevier 1

Topicsp

CPU metrics.

Categories of CPUs.g

CPU mechanisms.

High Performance Embedded Computing© 2007 Elsevier 2

Page 2: Chapter 2-1: CPUs

Performance as a design metricg

Performance = speed:Latency.

Throughput.

Average vs. peak fperformance.

Worst-case and best-fcase performance.

High Performance Embedded Computing© 2007 Elsevier 3

Other metrics

Cost (area).

Energy and power.gy p

Predictability: important for embedded systemssystems

Pipelining: branch penalty.

Memory system (Cache) : cache miss penalty

Security: difficult to measure because of the yfact that we do not know of a successful attack.

High Performance Embedded Computing© 2007 Elsevier 4

Page 3: Chapter 2-1: CPUs

Flynn’s taxonomy of processorsy y p

Single-instruction single-data (SISD): RISC, etc.

Single-instruction multiple-data (SIMD): all processors perform the same operationsprocessors perform the same operations.

Multiple-instruction multiple-data (MIMD): homogeneo s or heterogeneo shomogeneous or heterogeneous multiprocessor.

Multiple-instruction multiple data (MISD).

High Performance Embedded Computing© 2007 Elsevier 5

Other axes of comparisonRISC.

Emphasis on software Si l l i l i t tiSingle-cycle, simple instructions Register to register: LOAD" and "STORE“ are independent instructions Low cycles per second,Large code sizes Spends more transistors on memory registersSpends more transistors on memory registers

CISC.Emphasis on hardware multi-cycle, complex instructions Memory-to-memory: LOAD" and "STORE“ incorporated in instructions High cycles per secondSmall code sizesTransistors used for storing complex instructions

High Performance Embedded Computing© 2007 Elsevier 6

Transistors used for storing complex instructions

Page 4: Chapter 2-1: CPUs

RISC CISC

1. 1-cycle simple instructions 1. multi-cycle complex instructions

2. only LD/ST can access memory 2. any instruction may access memory

3 designed around pipeline 3 designed around instn set3. designed around pipeline 3. designed around instn. set

4. instns. executed by h/w 4. instns interpreted by micro-program

5. Fixed format instns 5. variable format instns

6 Few instns and modes 6 Many instns and modes6. Few instns and modes 6. Many instns and modes

7. Complexity in the compiler 7. Complexity in the micro-program

8. Multiple register sets 8. Single register set

High Performance Embedded Computing© 2007 Elsevier 7

Other axes of comparisonp

Instruction issue widthInstruction issue width.Single issue

Multiple issue: higher performance high cost increasedMultiple issue: higher performance, high cost, increased power consumption

Scheduling for multiple-issue machines.Scheduling for multiple issue machines.Static scheduling: VLIW

Dynamic schedule: superscalary p

Vector processing: instruction for 1D or 2D arrays

Multithreading: a fine-grained concurrencyMultithreading: a fine grained concurrency mechanism that allows the processor to quickly switch between several threads of execution

High Performance Embedded Computing© 2007 Elsevier 8

Page 5: Chapter 2-1: CPUs

Embedded vs. general-purpose processorsg p p p

E b dd d b ti i d fEmbedded processors may be optimized for a category of applications.

Must be flexibleMust be flexibleCustomization may be narrow or broad.Billions of 8-bit processors sold each yearp y100s millions of 32-bit processors for embedded systems

We may judge embedded processors using different metrics:

Code size.M t fMemory system performance.Preditability.

High Performance Embedded Computing© 2007 Elsevier 9

ARM Processor Familyy

Processorfamily

# of pipeline stages

Memory organization

Clock Rate

MIPS/MHz

ARM6 3 Von Neumann 25 MHARM6 3 Von Neumann 25 MHz

ARM7 3 Von Neumann 66 MHz 0.9

ARM8 5 Von Neumann 72 MHz 1 2ARM8 5 Von Neumann 72 MHz 1.2

ARM9 5 Harvard 200 MHz 1.1

ARM10 6 Harvard 400 MHz 1.25ARM10 6 400 MHz 1.25

StrongARM 5 Harvard 233 MHz 1.15

ARM11 8 Von Neumann/ 550 MHz 1.2Harvard

High Performance Embedded Computing© 2007 Elsevier 10

Page 6: Chapter 2-1: CPUs

ARM Architecture Version Summary yCore Version Feature

ARM1020T v5T Improved ARM/ThumbARM1020T v5T Improved ARM/Thumb

Interworking

CLZ instruction for

improved divisionimproved division

ARM9E-S, ARM10TDMI, ARM1020E v5TE Extended multiplication

and saturated maths for

DSP lik f ti litDSP-like functionality

ARM7EJ-S, ARM926EJ-S, ARM1026EJ-S v5TEJ Jazelle Technology for

Java acceleration

ARM11, ARM1136J-S, v6 Low power needed

SIMD (Single Instruction

Multiple Data) media

processing extensions

J: Jazelle

S: Synthesizable F: integral vector floating point unit

E: Enhanced DSP instruction

High Performance Embedded Computing© 2007 Elsevier 11

S: Synthesizable F: integral vector floating point unit

ARM7 3-stage pipeline organizationg p p gOrganizations

Address generating block address register

A[31:0] control

g gAddress registerIncrementer

Register bank

incrementer

register

PC

PCg31-GPRs, 6-PSRs2 read, 1 write portsAdditional 1 read, multiply

instruction

decode

&

registerbank

AL

1 write port for PCBarrel shifterALU

control

barrelshifter

LU bus

A bus

B bus

register

ALUData register

Control logic

ALU

Control logicExternal interfaceInstruction decoderDatapath control

data out register

D[31:0]

data in register

High Performance Embedded Computing© 2007 Elsevier 12

Datapath control D[31:0]

Page 7: Chapter 2-1: CPUs

ARM7 3-stage Pipelineg p

fARM7 family has 3 stage pipeline

3 stage pipelineFetchFetch

Instruction fetch from memory

Decode

PC F D E

Instruction decoding

Datapath control signals

for the next cycle

PC+i F D Efor the next cycle

ExecuteReading registers

PC+2i F D EShift and ALU operations

Writing back to the register bank

PC+2i F D E

High Performance Embedded Computing© 2007 Elsevier 13

ARM7 multi-cycle instructionsy

fetch ADD decode execute1

fetch STR decode calc. addr.

fetchADD decode execute

2

3

data xfer

fetch ADD decode execute3

fetch ADD decode execute4

5 fetch ADD decode execute

i t titime

instruction

High Performance Embedded Computing© 2007 Elsevier 14

Page 8: Chapter 2-1: CPUs

ARM7 multi-cycle instructions

Branch LDR

y

Branch LDR

LDR F D E1calc

E2xfer

E3move

B F D E1calc

E2link

E3adjust calc xfer move

F D E

j

PC+i F

F D E

discarded

PC+2i F

F D E

discarded

T F D E

T+i F D E

High Performance Embedded Computing© 2007 Elsevier 15

ARM9TDMI core

LDR BranchF D E M WADD F D E M W

B F D E1 E2 E3 M WLDR F D E M W

FF D E M W

F

F D E M W

F D E M W

F D E M WSeparated cacheInstruction and data cacheare accessible at the same time

High Performance Embedded Computing© 2007 Elsevier 16

Page 9: Chapter 2-1: CPUs

ARM11 8 stage pipeline

Branch Prediction and Return StackBranch Prediction and Return Stack

Separate processing units for the ALU, MAC, and Load-Store (LS) instructions

lth h th i li i i l ialthough the pipeline is single issue

High Performance Embedded Computing© 2007 Elsevier 17

Feature Comparisonp

Feature ARM9E ARM10E XScale ARM11Feature ARM9E ARM10E XScale ARM11

Architecture ARMv5TE(J) ARMv5TE(J) ARMv5TE ARMv6

pipeline length 5 6 7~8 8

Java decode (ARM926EJ) (ARM1026EJ) No Yes

V6 SIMD instructions

No No No Yes

MIA instructions No No YesAvailable as coprocessor

Branch prediction No Static Dynamic Dynamic

Independent Load-store unit

No Yes Yes Yes

Instruction issue Scalar, in-order Scalar, in-order Scalar, in-order Scalar, in-order

Concurrency None ALU/MAC, LSU ALU, MAC, LSU ALU, MAC, LSU

Out-of-order completion

No Yes Yes Yes

Target Synthesizable Synthesizable Custom chip

Synthesizable and gimplementation

Synthesizable Synthesizable Custom chipy

Hardmacro

Performance range

Up to 250MHz Up to 325MHz200MHz ~

> 1GHz350MHz ~

> 1GHz

High Performance Embedded Computing© 2007 Elsevier 18

Page 10: Chapter 2-1: CPUs

MIPS32 processor familyp y

MIPS: MIPS32 4K has 5-stage pipeline; 4KE g p p ;family has DSP extension; 4KS is designed for securityfor security.

High Performance Embedded Computing© 2007 Elsevier 19

MIPS32 processor familyp y

High Performance Embedded Computing© 2007 Elsevier 20

Page 11: Chapter 2-1: CPUs

PowerPC processor familyp y

PowerPC: 400 series includes several embedded processors; MPD7410 is two-issue machine; 970FX has 16-stage pipelineissue machine; 970FX has 16 stage pipeline.

High Performance Embedded Computing© 2007 Elsevier 21

PowerPC processor familyp y

High Performance Embedded Computing© 2007 Elsevier 22

Page 12: Chapter 2-1: CPUs

PowerPC processor familyp y

High Performance Embedded Computing© 2007 Elsevier 23

What is DSP?

DSP = Digital Signal Processingg g gOR

DSP = Digital Signal Processor?

DSP used to denote both i b d d d f th t t i hi h thmeaning can be deduced from the context in which the

term DSP is used.What is a Digital Signal Processor (DSP)?g g ( )

Microprocessor specifically designed to perform fast DSP operations (e.g., Fast Fourier Transforms, inner products, Multiply & Accumulate)p y )

High Performance Embedded Computing© 2007 Elsevier 24

Page 13: Chapter 2-1: CPUs

DSP performancep

Wireless Systems requires more and more high y q gperformance and higher bandwidth

P fDSP performance

3GPerformance

~100,000MIPS384 2000 Kb

pmight not be enough for future applications

2.5G~10,000MIPS64-384 Kbps

384-2000 Kbps applications

2GBit Rate~100MIPS

8-13 Kbps

High Performance Embedded Computing© 2007 Elsevier 25

Digital signal processorsg g p

First DSP was AT&T DSP16:

Hardware multiply-accumulate unit.

Harvard architectureHarvard architecture.

Today, DSP is often used as a marketingused as a marketing term.

Modern DSPs areModern DSPs are heavily pipelined.

High Performance Embedded Computing© 2007 Elsevier 26

Page 14: Chapter 2-1: CPUs

TMS320C55x ™ DSP Generation, 16-bit Fi d P i t M t P Effi i t DSPFixed Point – Most Power Efficient DSP

Features ApplicationsSpecifications• C55x™ DSP core delivers 300 MHz for up to 600-MIPS performance

• Feature-rich, miniaturized per-

sonal and portable products

• Advanced automatic power management

• 1.6-volt core and 3.3-volt peripherals

sonal and portable products

• 2G, 2.5G and 3G cell phones

and basestations

• Digital audio players

• Configurable idle domains to extend your battery life

• Shortened debug for faster time-to-market

144 MH /200 MH l k t • Digital still cameras

• Electronic books

• Voice recognition

• GPS receivers

• 144-MHz/200-MHz clock rate

• 256-KB RAM, 64-KB ROM

• Three McBSPs, I 2 C, watchdog

timer, general-purpose timers

• Fingerprint/Pattern recognition

• Wireless modems

• Headsets

• Biometrics

• USB 2.0 full-speed (12 Mbps)

•10-bit ADC

•real-time clock (RTC)

High Performance Embedded Computing© 2007 Elsevier 27

TMS320C55x ™ DSP + RISC,16 bit Fi d P i t OMAP P16-bit Fixed Point – OMAP Processor

Features ApplicationsSpecifications150 MHz TI enhanced ARM925• Dual CPU processor integrating a

TMS320C55x™ DSP core and an ARM925TDMI™ RISC @150 MHz

• 1.8-volt core and 1.8-volt peripherals

• Internet appliances

• Applications processing

• Enhanced gaming

• Webpad

150-MHz TI-enhanced ARM925

• 16 KB instruction cache and 8 KB data cache

• Data and instruction MMUs

• 32-bit and 16-bit instruction sets • Webpad

• Point-of-sale

• Medical devices

• Industry-specific PDAs

• 32-bit and 16-bit instruction sets

150-MHz TMS320C55x™ DSP

• 12 KW (24 KB) instruction cache

• 80 KW (160 KB) SRAM

• Telematics

• Digital media processing

• Military and government cellular

• 16 KW (32 KB) ROM

• Two 16-bit memory interfaces

for SDRAM and flash

• Nine-channel system DMA

controller

• LCD controller

• USB 1.1 host and client

• MMC/SD card interface

• Seven serial ports plus three

UARTs, Nine timers, Keyboard interface

• Less than 250 mW at 1.6 V

High Performance Embedded Computing© 2007 Elsevier 28

Page 15: Chapter 2-1: CPUs

TMS320C62x ™ DSP Generation, 16-bit Fi d P i t Hi h P f DSPFixed Point – High Performance DSP

Features ApplicationsSpecifications• 16-bit fixed-point DSPs

• Up to 2400 MIPS

• Pooled modems

• Digital Subscriber Line (xDSL)

• C6000™ DSP Platform VelociTI™ advanced architecture

Up to 2400 MIPS

•Running at 300 Mhz

• Digital Subscriber Line (xDSL)

• Wireless basestations

• Central office switches

• Private Branch Exchange (PBX)

• Up to eight 32-bit instructions executed each cycle

• Eight independent, multi-purpose functional units thirty-two 32-bit registers

• Digital imaging

• Call processing

• 3D graphics

• Speech recognition

registers

• Industry’s most advanced C compiler and Assembly Optimizer maximize efficiency and performance

• Voice over packet

High Performance Embedded Computing© 2007 Elsevier 29

TMS320C67x ™ DSP Generation, 32-bit Fl ti P i t Hi h P f DSPFloating Point – High Performance DSP

Features ApplicationsSpecifications• 32-bit floating point DSPs

• Up to 1350 MFLOPS

• Pooled modems

• Digital Subscriber Line (xDSL)

• C6000™ DSP Platform VelociTI™ advanced architecture

Up to 1350 MFLOPS

•Running at 225 Mhz

• Digital Subscriber Line (xDSL)

• Wireless basestations

• Central office switches

• Private Branch Exchange (PBX)

• Up to eight 32-bit instructions executed each cycle

• Eight independent, multi-purpose functional units thirty-two 32-bit registers

• Digital imaging

• Call processing

• 3D graphics

• Speech recognition

registers

• Industry’s most advanced C compiler and Assembly Optimizer maximize efficiency and performance

• IEEE floating-point format

• Voice over packet• Up to 1350 MFLOPS at 225

• Two new multi-channel serial ports (McASP) (C6713 DSP) can support up to stereo channels of I2S (Inter IC Sound) and compatible with S/PDIF transmit protocol. Note I2S is a protocol for transmitting 2 channels of digital audio over a single serial connection

High Performance Embedded Computing© 2007 Elsevier 30

Page 16: Chapter 2-1: CPUs

TMS320C64x ™ DSP Generation, 16-bit Fi d P i t Hi h P f DSPFixed Point – High Performance DSP

Features ApplicationsSpecifications•16-bit fixed point processor

TMS320C64x DSP high per-

•DSL and pooled modems

•Basestation transceivers

• C6000™ DSP Platform VelociTI™ advanced architecture

TMS320C64x DSP high per

formance core provides scalable

performance of up to 1.1 GHz

• The industry’s fastest DSPs with

t 600 MH (4800 MIPS)

•Basestation transceivers

•Wireless LAN

•Enterprise PBX

•Multimedia gateway

• Up to eight 32-bit instructions executed each cycle

• Eight independent, multi-purpose functional units thirty-two 32-bit registers

up to 600 MHz (4800 MIPS) performance

• C64x DSPs are software compatible with TI’s C62x™ DSPs

•Broadband video transcoders

•Streaming video servers and clients

•Highspeed raster image processing (RIP)

registers

• Industry’s most advanced C compiler and Assembly Optimizer maximize efficiency and performance

p g ( )

High Performance Embedded Computing© 2007 Elsevier 31

Example: TI C5x DSPp

High Performance Embedded Computing© 2007 Elsevier 32

Page 17: Chapter 2-1: CPUs

Example: TI C5x DSPp

40-bit arithmetic unit 32-bit values with 8 guard bitsg

Barrel shifter.

17 x 17 multiplier17 x 17 multiplier.

Comparison unit for Viterbi encoding/decoding.

Single-cycle exponent encoder for wide-dynamic-range arithmetic.dy a c a ge a t et c

Two address generators.

High Performance Embedded Computing© 2007 Elsevier 33

TI C55x microarchitecture

High Performance Embedded Computing© 2007 Elsevier 34

Page 18: Chapter 2-1: CPUs

TI C55x co-processorsp

Designed to support Pixel interpolation

A U BMotion estimation

DCT/IDCT computation

I t l t U M R

A U B

Interpolates U, M, R values given A, B, C, D pixels

M R

pixels. C D

High Performance Embedded Computing© 2007 Elsevier 35

Fixed Point Vs Floating Pointg

Floating Point Fixed PointFloating Point Fixed Point

Applications

•Modems

Applications

•Portable Products

•Digital Subscriber Line (DSL)

•Wireless Basestations

•2G, 2.5G and 3G Cell Phones

•Digital Audio Players

Di it l Still C•Central Office Switches

•Private Branch Exchange (PBX)

•Digital Imaging

•Digital Still Cameras

•Electronic Books

•Voice Recognition•Digital Imaging

•3D Graphics

•Speech Recognition

Voice Recognition

•GPS Receivers

•Headsetsp g

•Voice over IP •Biometrics

•Fingerprint Recognition

High Performance Embedded Computing© 2007 Elsevier 36

Page 19: Chapter 2-1: CPUs

Simple VLIW architecturep

Powerful compilerA packet of instructionLarge register file with multiple ports feeds multiple function unitsfunction units.

E boxAdd 1 2 3 S b 4 5 6 Ld 7 f St 8 b NOP

Register file

Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP

Register file

ALU ALU Load/store Load/store FU

High Performance Embedded Computing© 2007 Elsevier 37

Clustered VLIW architecture

Register file, function units divided into clusters.

Cluster bus

Execution Execution

Register file Register file

High Performance Embedded Computing© 2007 Elsevier 38

Page 20: Chapter 2-1: CPUs

Parallelism extraction in VLIW

Static:Use compiler to analyze program

Dynamic:Use hardware to identify opportunitiesanalyze program.

Simpler CPU.

Can make use of high

identify opportunities.

More complex CPU.

Can make use of dataCan make use of high-level language constructs.

Can make use of data values.

Can’t depend on data values.

High Performance Embedded Computing© 2007 Elsevier 39

Motorola Starcore SC140

DALU i l d 4 ALU 1 i t filDALU includes 4 ALUs, 1 register file.AGU includes 2 address arithmetic units (AAU) 1 address register file(AAU), 1 address register file.Program sequencer and control unit (PSEQ).P fPerformance:

4 MACs per cycle.10 RISC MIPS MH l k10 RISC MIPS per MHz clock.

High Performance Embedded Computing© 2007 Elsevier 40

Page 21: Chapter 2-1: CPUs

SC140 Core

ProgramProgramsequencer Address

Register fileData

Register file

Powermgt

Clock/PLL2 AAUs BMU 4 ALUs

AGU DALU

High Performance Embedded Computing© 2007 Elsevier 41

Typical SC140 configurationyp g

Level 1 memory expansionRAM, ROM

DMA,CacheProgram sequencer Cache,

Interrupts,Level 2 mem,

Program sequencer

ALUALU AAUEtc.

ALUALUALU

AAU

peripherals

High Performance Embedded Computing© 2007 Elsevier 42

Page 22: Chapter 2-1: CPUs

Instruction format

16-bit instructions.

Up to 6 instructions per cycle.p p y

Instructions are grouped to define allowable simultaneous operationssimultaneous operations.

MACR –D0,D1,D7 AND D4,D5 MOVE.L (R0),+N0,R6 ADDA R2,R3

DALU AGUHigh Performance Embedded Computing

© 2007 Elsevier 43

DALU AGU

AGU addressingg

Allowable addressing modes:Linear: useful for general purpose addressing.g g

Modulo: useful for FIFO queues.

Reverse-carry: useful for FFT.Reverse carry: useful for FFT.

Automatic updating during register indirect.

StackStack.

Array addressing: base, offset, modifier i tregisters.

High Performance Embedded Computing© 2007 Elsevier 44

Page 23: Chapter 2-1: CPUs

TM-1 characteristics

27 function units

Characteristics5 RISC operations/sec5 RISC operations/sec

Floating point support

Sub-word parallelismSub-word parallelism support

Guarded operation (If Conversion)

Additional custom tioperations

High Performance Embedded Computing© 2007 Elsevier 45

TM 1 VLIW CPUTM-1 VLIW CPU

i filregister file

read/write crossbar

FU1 FU27...

slot 1 slot 2 slot 3 slot 4 slot 5

High Performance Embedded Computing© 2007 Elsevier 46

Page 24: Chapter 2-1: CPUs

Trimedia TM 1Trimedia TM-1

memory interface

video in video outvideo in

audio in

video out

audio out

I2C serial

timers

image co-p

VLD co-p

ge co p

PCIVLIW CPU

High Performance Embedded Computing© 2007 Elsevier 47

Superscalar processorsp p

Instructions are dynamically scheduled.Dependencies are checked at run time in hardware.

Used to some extent in embeddedUsed to some extent in embedded processors.

Embedded Pentium is two-issue in-orderEmbedded Pentium is two-issue in-order.

High Performance Embedded Computing© 2007 Elsevier 48

Page 25: Chapter 2-1: CPUs

SIMD and subword parallelismp

Many special-purpose SIMD machines.

Subword parallelism is widely used for video.p yALU is divided into subwords for independent operations on small operands.operations on small operands.

Vector processing is widely used for integer valuesvalues.

High Performance Embedded Computing© 2007 Elsevier 49

SIMD Extensions

High Performance Embedded Computing© 2007 Elsevier 50

Page 26: Chapter 2-1: CPUs

SIMD Extensions

High Performance Embedded Computing© 2007 Elsevier 51

Multithreadingg

Low-level parallelism mechanism.

Hardware multithreading alternately fetches g yinstructions from separate threads.

Simultaneous multithreading (SMT) fetchesSimultaneous multithreading (SMT) fetches instructions from several threads on each c clecycle.

High Performance Embedded Computing© 2007 Elsevier 52

Page 27: Chapter 2-1: CPUs

Processor Resource Utilization

Processor choice depends on program characteristics.

Leverage our knowledge of the core algorithms

Many researchers assume that multimediaMany researchers assume that multimedia algorithms exhibit high levels of parallelism.

Experiments with SimpleScalar shows that this isExperiments with SimpleScalar shows that this is not the case.

Most applications exhibit fewer than 4 IPCMost applications exhibit fewer than 4 IPC.

High Performance Embedded Computing© 2007 Elsevier 53

Available parallelism in multimedia li i (T ll l )applications (Talla et al.)

High Performance Embedded Computing© 2007 Elsevier 54

Page 28: Chapter 2-1: CPUs

Dynamic behavior of loops in y pMediaBench (Fritts)

Path ratio (instructions executed per iteration) / (total number of loop instructions)loop instructions).

M di B h h ll th tiMediaBench shows small path ratio -> considerable conditional behavior in loops.

High Performance Embedded Computing© 2007 Elsevier 55

Operand characteristics in MediaBench

More than 10

78%78%

High Performance Embedded Computing© 2007 Elsevier 56

Page 29: Chapter 2-1: CPUs

Operand characteristics in MediaBench

High Performance Embedded Computing© 2007 Elsevier 57

Operand characteristics in Video Codecs

High Performance Embedded Computing© 2007 Elsevier 58

Page 30: Chapter 2-1: CPUs

Dynamic voltage scaling (DVS)y g g ( )

P l ith V2Power scales with V2

while performance scales roughly as Vscales roughly as V.Reduce operating voltage, add parallelvoltage, add parallel operating units to make up for lower clock

dspeed.DVS doesn’t work in high leakagehigh-leakage processors.

High Performance Embedded Computing© 2007 Elsevier 59

Dynamic voltage and frequency scaling y g q y g(DVFS)

Scale both voltage and clock frequency.

Can use control algorithms to match

f tperformance to application, reduce powerpower.

High Performance Embedded Computing© 2007 Elsevier 60

Page 31: Chapter 2-1: CPUs

Razor architecture

Used specialized latch to detect errors.

Recovers only on errors, gains average-case

fperformance.

High Performance Embedded Computing© 2007 Elsevier 61

Razor architecture

Used specialized latch to detect errors.

Recovers only on errors, gains average-case

fperformance.

High Performance Embedded Computing© 2007 Elsevier 62