chapter 2-1: cpus

Chapter 2-1: CPUs

Soo-Ik Chae

© 2007 Elsevier 1

Topicsp

CPU metrics.

Categories of CPUs.g

CPU mechanisms.

High Performance Embedded Computing© 2007 Elsevier 2

Performance as a design metricg

Performance = speed:Latency.

Throughput.

Average vs. peak fperformance.

Worst-case and best-fcase performance.


Other metrics

Cost (area).

Energy and power.gy p

Predictability: important for embedded systemssystems

Pipelining: branch penalty.

Memory system (Cache) : cache miss penalty

Security: difficult to measure because of the yfact that we do not know of a successful attack.


Flynn’s taxonomy of processorsy y p

Single-instruction single-data (SISD): RISC, etc.

Single-instruction multiple-data (SIMD): all processors perform the same operationsprocessors perform the same operations.

Multiple-instruction multiple-data (MIMD): homogeneo s or heterogeneo shomogeneous or heterogeneous multiprocessor.

Multiple-instruction multiple data (MISD).


Other axes of comparisonRISC.

Emphasis on software Si l l i l i t tiSingle-cycle, simple instructions Register to register: LOAD" and "STORE“ are independent instructions Low cycles per second,Large code sizes Spends more transistors on memory registersSpends more transistors on memory registers

CISC.Emphasis on hardware multi-cycle, complex instructions Memory-to-memory: LOAD" and "STORE“ incorporated in instructions High cycles per secondSmall code sizesTransistors used for storing complex instructions


Transistors used for storing complex instructions

RISC CISC

1. 1-cycle simple instructions 1. multi-cycle complex instructions

2. only LD/ST can access memory 2. any instruction may access memory

3 designed around pipeline 3 designed around instn set3. designed around pipeline 3. designed around instn. set

4. instns. executed by h/w 4. instns interpreted by micro-program

5. Fixed format instns 5. variable format instns

6 Few instns and modes 6 Many instns and modes6. Few instns and modes 6. Many instns and modes

7. Complexity in the compiler 7. Complexity in the micro-program

8. Multiple register sets 8. Single register set


Other axes of comparisonp

Instruction issue widthInstruction issue width.Single issue

Multiple issue: higher performance high cost increasedMultiple issue: higher performance, high cost, increased power consumption

Scheduling for multiple-issue machines.Scheduling for multiple issue machines.Static scheduling: VLIW

Dynamic schedule: superscalary p

Vector processing: instruction for 1D or 2D arrays

Multithreading: a fine-grained concurrencyMultithreading: a fine grained concurrency mechanism that allows the processor to quickly switch between several threads of execution


Embedded vs. general-purpose processorsg p p p

E b dd d b ti i d fEmbedded processors may be optimized for a category of applications.

Must be flexibleMust be flexibleCustomization may be narrow or broad.Billions of 8-bit processors sold each yearp y100s millions of 32-bit processors for embedded systems

We may judge embedded processors using different metrics:

Code size.M t fMemory system performance.Preditability.


ARM Processor Familyy

Processorfamily

# of pipeline stages

Memory organization

Clock Rate

MIPS/MHz

ARM6 3 Von Neumann 25 MHARM6 3 Von Neumann 25 MHz

ARM7 3 Von Neumann 66 MHz 0.9

ARM8 5 Von Neumann 72 MHz 1 2ARM8 5 Von Neumann 72 MHz 1.2

ARM9 5 Harvard 200 MHz 1.1

ARM10 6 Harvard 400 MHz 1.25ARM10 6 400 MHz 1.25

StrongARM 5 Harvard 233 MHz 1.15

ARM11 8 Von Neumann/ 550 MHz 1.2Harvard


ARM Architecture Version Summary yCore Version Feature

ARM1020T v5T Improved ARM/ThumbARM1020T v5T Improved ARM/Thumb

Interworking

CLZ instruction for

improved divisionimproved division

ARM9E-S, ARM10TDMI, ARM1020E v5TE Extended multiplication

and saturated maths for

DSP lik f ti litDSP-like functionality

ARM7EJ-S, ARM926EJ-S, ARM1026EJ-S v5TEJ Jazelle Technology for

Java acceleration

ARM11, ARM1136J-S, v6 Low power needed

SIMD (Single Instruction

Multiple Data) media

processing extensions

J: Jazelle

S: Synthesizable F: integral vector floating point unit

E: Enhanced DSP instruction


S: Synthesizable F: integral vector floating point unit

ARM7 3-stage pipeline organizationg p p gOrganizations

Address generating block address register

A[31:0] control

g gAddress registerIncrementer

Register bank

incrementer

register

PC

PCg31-GPRs, 6-PSRs2 read, 1 write portsAdditional 1 read, multiply

instruction

decode

&

registerbank

AL

1 write port for PCBarrel shifterALU

control

barrelshifter

LU bus

A bus

B bus

register

ALUData register

Control logic

ALU

Control logicExternal interfaceInstruction decoderDatapath control

data out register

D[31:0]

data in register


Datapath control D[31:0]

ARM7 3-stage Pipelineg p

fARM7 family has 3 stage pipeline

3 stage pipelineFetchFetch

Instruction fetch from memory

Decode

PC F D E

Instruction decoding

Datapath control signals

for the next cycle

PC+i F D Efor the next cycle

ExecuteReading registers

PC+2i F D EShift and ALU operations

Writing back to the register bank

PC+2i F D E


ARM7 multi-cycle instructionsy

fetch ADD decode execute1

fetch STR decode calc. addr.

fetchADD decode execute

2

3

data xfer



5 fetch ADD decode execute

i t titime

instruction


ARM7 multi-cycle instructions

Branch LDR

y

Branch LDR

LDR F D E1calc

E2xfer

E3move

B F D E1calc

E2link

E3adjust calc xfer move

F D E

j

PC+i F

F D E

discarded

PC+2i F

F D E

discarded

T F D E

T+i F D E


ARM9TDMI core

LDR BranchF D E M WADD F D E M W

B F D E1 E2 E3 M WLDR F D E M W

FF D E M W

F

F D E M W

F D E M W

F D E M WSeparated cacheInstruction and data cacheare accessible at the same time


ARM11 8 stage pipeline

Branch Prediction and Return StackBranch Prediction and Return Stack

Separate processing units for the ALU, MAC, and Load-Store (LS) instructions

lth h th i li i i l ialthough the pipeline is single issue


Feature Comparisonp

Feature ARM9E ARM10E XScale ARM11Feature ARM9E ARM10E XScale ARM11

Architecture ARMv5TE(J) ARMv5TE(J) ARMv5TE ARMv6

pipeline length 5 6 7~8 8

Java decode (ARM926EJ) (ARM1026EJ) No Yes

V6 SIMD instructions

No No No Yes

MIA instructions No No YesAvailable as coprocessor

Branch prediction No Static Dynamic Dynamic

Independent Load-store unit

No Yes Yes Yes

Instruction issue Scalar, in-order Scalar, in-order Scalar, in-order Scalar, in-order

Concurrency None ALU/MAC, LSU ALU, MAC, LSU ALU, MAC, LSU

Out-of-order completion

No Yes Yes Yes

Target Synthesizable Synthesizable Custom chip

Synthesizable and gimplementation

Synthesizable Synthesizable Custom chipy

Hardmacro

Performance range

Up to 250MHz Up to 325MHz200MHz ~

> 1GHz350MHz ~

> 1GHz


MIPS32 processor familyp y

MIPS: MIPS32 4K has 5-stage pipeline; 4KE g p p ;family has DSP extension; 4KS is designed for securityfor security.


MIPS32 processor familyp y


PowerPC processor familyp y

PowerPC: 400 series includes several embedded processors; MPD7410 is two-issue machine; 970FX has 16-stage pipelineissue machine; 970FX has 16 stage pipeline.






What is DSP?

DSP = Digital Signal Processingg g gOR

DSP = Digital Signal Processor?

DSP used to denote both i b d d d f th t t i hi h thmeaning can be deduced from the context in which the

term DSP is used.What is a Digital Signal Processor (DSP)?g g ( )

Microprocessor specifically designed to perform fast DSP operations (e.g., Fast Fourier Transforms, inner products, Multiply & Accumulate)p y )


DSP performancep

Wireless Systems requires more and more high y q gperformance and higher bandwidth

P fDSP performance

3GPerformance

~100,000MIPS384 2000 Kb

pmight not be enough for future applications

2.5G~10,000MIPS64-384 Kbps

384-2000 Kbps applications

2GBit Rate~100MIPS

8-13 Kbps


Digital signal processorsg g p

First DSP was AT&T DSP16:

Hardware multiply-accumulate unit.

Harvard architectureHarvard architecture.

Today, DSP is often used as a marketingused as a marketing term.

Modern DSPs areModern DSPs are heavily pipelined.


TMS320C55x ™ DSP Generation, 16-bit Fi d P i t M t P Effi i t DSPFixed Point – Most Power Efficient DSP

Features ApplicationsSpecifications• C55x™ DSP core delivers 300 MHz for up to 600-MIPS performance

• Feature-rich, miniaturized per-

sonal and portable products

• Advanced automatic power management

• 1.6-volt core and 3.3-volt peripherals

sonal and portable products

• 2G, 2.5G and 3G cell phones

and basestations

• Digital audio players

• Configurable idle domains to extend your battery life

• Shortened debug for faster time-to-market

144 MH /200 MH l k t • Digital still cameras

• Electronic books

• Voice recognition

• GPS receivers

• 144-MHz/200-MHz clock rate

• 256-KB RAM, 64-KB ROM

• Three McBSPs, I 2 C, watchdog

timer, general-purpose timers

• Fingerprint/Pattern recognition

• Wireless modems

• Headsets

• Biometrics

• USB 2.0 full-speed (12 Mbps)

•10-bit ADC

•real-time clock (RTC)


TMS320C55x ™ DSP + RISC,16 bit Fi d P i t OMAP P16-bit Fixed Point – OMAP Processor

Features ApplicationsSpecifications150 MHz TI enhanced ARM925• Dual CPU processor integrating a

TMS320C55x™ DSP core and an ARM925TDMI™ RISC @150 MHz

• 1.8-volt core and 1.8-volt peripherals

• Internet appliances

• Applications processing

• Enhanced gaming

• Webpad

150-MHz TI-enhanced ARM925

• 16 KB instruction cache and 8 KB data cache

• Data and instruction MMUs

• 32-bit and 16-bit instruction sets • Webpad

• Point-of-sale

• Medical devices

• Industry-specific PDAs

• 32-bit and 16-bit instruction sets

150-MHz TMS320C55x™ DSP

• 12 KW (24 KB) instruction cache

• 80 KW (160 KB) SRAM

• Telematics

• Digital media processing

• Military and government cellular

• 16 KW (32 KB) ROM

• Two 16-bit memory interfaces

for SDRAM and flash

• Nine-channel system DMA

controller

• LCD controller

• USB 1.1 host and client

• MMC/SD card interface

• Seven serial ports plus three

UARTs, Nine timers, Keyboard interface

• Less than 250 mW at 1.6 V


TMS320C62x ™ DSP Generation, 16-bit Fi d P i t Hi h P f DSPFixed Point – High Performance DSP

Features ApplicationsSpecifications• 16-bit fixed-point DSPs

• Up to 2400 MIPS

• Pooled modems

• Digital Subscriber Line (xDSL)

• C6000™ DSP Platform VelociTI™ advanced architecture

Up to 2400 MIPS

•Running at 300 Mhz


• Wireless basestations

• Central office switches

• Private Branch Exchange (PBX)

• Up to eight 32-bit instructions executed each cycle

• Eight independent, multi-purpose functional units thirty-two 32-bit registers

• Digital imaging

• Call processing

• 3D graphics

• Speech recognition

registers

• Industry’s most advanced C compiler and Assembly Optimizer maximize efficiency and performance

• Voice over packet


TMS320C67x ™ DSP Generation, 32-bit Fl ti P i t Hi h P f DSPFloating Point – High Performance DSP

Features ApplicationsSpecifications• 32-bit floating point DSPs

• Up to 1350 MFLOPS

• Pooled modems



Up to 1350 MFLOPS

•Running at 225 Mhz


• Wireless basestations

• Central office switches

• Private Branch Exchange (PBX)



• Digital imaging

• Call processing

• 3D graphics

• Speech recognition

registers


• IEEE floating-point format

• Voice over packet• Up to 1350 MFLOPS at 225

• Two new multi-channel serial ports (McASP) (C6713 DSP) can support up to stereo channels of I2S (Inter IC Sound) and compatible with S/PDIF transmit protocol. Note I2S is a protocol for transmitting 2 channels of digital audio over a single serial connection


TMS320C64x ™ DSP Generation, 16-bit Fi d P i t Hi h P f DSPFixed Point – High Performance DSP

Features ApplicationsSpecifications•16-bit fixed point processor

TMS320C64x DSP high per-

•DSL and pooled modems

•Basestation transceivers


TMS320C64x DSP high per

formance core provides scalable

performance of up to 1.1 GHz

• The industry’s fastest DSPs with

t 600 MH (4800 MIPS)

•Basestation transceivers

•Wireless LAN

•Enterprise PBX

•Multimedia gateway



up to 600 MHz (4800 MIPS) performance

• C64x DSPs are software compatible with TI’s C62x™ DSPs

•Broadband video transcoders

•Streaming video servers and clients

•Highspeed raster image processing (RIP)

registers


p g ( )


Example: TI C5x DSPp


Example: TI C5x DSPp

40-bit arithmetic unit 32-bit values with 8 guard bitsg

Barrel shifter.

17 x 17 multiplier17 x 17 multiplier.

Comparison unit for Viterbi encoding/decoding.

Single-cycle exponent encoder for wide-dynamic-range arithmetic.dy a c a ge a t et c

Two address generators.


TI C55x microarchitecture


TI C55x co-processorsp

Designed to support Pixel interpolation

A U BMotion estimation

DCT/IDCT computation

I t l t U M R

A U B

Interpolates U, M, R values given A, B, C, D pixels

M R

pixels. C D


Fixed Point Vs Floating Pointg

Floating Point Fixed PointFloating Point Fixed Point

Applications

•Modems

Applications

•Portable Products

•Digital Subscriber Line (DSL)

•Wireless Basestations

•2G, 2.5G and 3G Cell Phones

•Digital Audio Players

Di it l Still C•Central Office Switches

•Private Branch Exchange (PBX)

•Digital Imaging

•Digital Still Cameras

•Electronic Books

•Voice Recognition•Digital Imaging

•3D Graphics

•Speech Recognition

Voice Recognition

•GPS Receivers

•Headsetsp g

•Voice over IP •Biometrics

•Fingerprint Recognition


Simple VLIW architecturep

Powerful compilerA packet of instructionLarge register file with multiple ports feeds multiple function unitsfunction units.

E boxAdd 1 2 3 S b 4 5 6 Ld 7 f St 8 b NOP

Register file

Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP

Register file

ALU ALU Load/store Load/store FU


Clustered VLIW architecture

Register file, function units divided into clusters.

Cluster bus

Execution Execution

Register file Register file


Parallelism extraction in VLIW

Static:Use compiler to analyze program

Dynamic:Use hardware to identify opportunitiesanalyze program.

Simpler CPU.

Can make use of high

identify opportunities.

More complex CPU.

Can make use of dataCan make use of high-level language constructs.

Can make use of data values.

Can’t depend on data values.


Motorola Starcore SC140

DALU i l d 4 ALU 1 i t filDALU includes 4 ALUs, 1 register file.AGU includes 2 address arithmetic units (AAU) 1 address register file(AAU), 1 address register file.Program sequencer and control unit (PSEQ).P fPerformance:

4 MACs per cycle.10 RISC MIPS MH l k10 RISC MIPS per MHz clock.


SC140 Core

ProgramProgramsequencer Address

Register fileData

Register file

Powermgt

Clock/PLL2 AAUs BMU 4 ALUs

AGU DALU


Typical SC140 configurationyp g

Level 1 memory expansionRAM, ROM

DMA,CacheProgram sequencer Cache,

Interrupts,Level 2 mem,

Program sequencer

ALUALU AAUEtc.

ALUALUALU

AAU

peripherals


Instruction format

16-bit instructions.

Up to 6 instructions per cycle.p p y

Instructions are grouped to define allowable simultaneous operationssimultaneous operations.

MACR –D0,D1,D7 AND D4,D5 MOVE.L (R0),+N0,R6 ADDA R2,R3

DALU AGUHigh Performance Embedded Computing

© 2007 Elsevier 43

DALU AGU

AGU addressingg

Allowable addressing modes:Linear: useful for general purpose addressing.g g

Modulo: useful for FIFO queues.

Reverse-carry: useful for FFT.Reverse carry: useful for FFT.

Automatic updating during register indirect.

StackStack.

Array addressing: base, offset, modifier i tregisters.


TM-1 characteristics

27 function units

Characteristics5 RISC operations/sec5 RISC operations/sec

Floating point support

Sub-word parallelismSub-word parallelism support

Guarded operation (If Conversion)

Additional custom tioperations


TM 1 VLIW CPUTM-1 VLIW CPU

i filregister file

read/write crossbar

FU1 FU27...

slot 1 slot 2 slot 3 slot 4 slot 5


Trimedia TM 1Trimedia TM-1

memory interface

video in video outvideo in

audio in

video out

audio out

I2C serial

timers

image co-p

VLD co-p

ge co p

PCIVLIW CPU


Superscalar processorsp p

Instructions are dynamically scheduled.Dependencies are checked at run time in hardware.

Used to some extent in embeddedUsed to some extent in embedded processors.

Embedded Pentium is two-issue in-orderEmbedded Pentium is two-issue in-order.


SIMD and subword parallelismp

Many special-purpose SIMD machines.

Subword parallelism is widely used for video.p yALU is divided into subwords for independent operations on small operands.operations on small operands.

Vector processing is widely used for integer valuesvalues.


SIMD Extensions


SIMD Extensions


Multithreadingg

Low-level parallelism mechanism.

Hardware multithreading alternately fetches g yinstructions from separate threads.

Simultaneous multithreading (SMT) fetchesSimultaneous multithreading (SMT) fetches instructions from several threads on each c clecycle.


Processor Resource Utilization

Processor choice depends on program characteristics.

Leverage our knowledge of the core algorithms

Many researchers assume that multimediaMany researchers assume that multimedia algorithms exhibit high levels of parallelism.

Experiments with SimpleScalar shows that this isExperiments with SimpleScalar shows that this is not the case.

Most applications exhibit fewer than 4 IPCMost applications exhibit fewer than 4 IPC.


Available parallelism in multimedia li i (T ll l )applications (Talla et al.)


Dynamic behavior of loops in y pMediaBench (Fritts)

Path ratio (instructions executed per iteration) / (total number of loop instructions)loop instructions).

M di B h h ll th tiMediaBench shows small path ratio -> considerable conditional behavior in loops.


Operand characteristics in MediaBench

More than 10

78%78%


Operand characteristics in MediaBench


Operand characteristics in Video Codecs


Dynamic voltage scaling (DVS)y g g ( )

P l ith V2Power scales with V2

while performance scales roughly as Vscales roughly as V.Reduce operating voltage, add parallelvoltage, add parallel operating units to make up for lower clock

dspeed.DVS doesn’t work in high leakagehigh-leakage processors.


Dynamic voltage and frequency scaling y g q y g(DVFS)

Scale both voltage and clock frequency.

Can use control algorithms to match

f tperformance to application, reduce powerpower.


Razor architecture

Used specialized latch to detect errors.

Recovers only on errors, gains average-case

fperformance.


Razor architecture

Used specialized latch to detect errors.

Recovers only on errors, gains average-case

fperformance.


chapter 2-1: cpus

Documents