eecs 470 computer architecture lecture 2 coverage: chapters 1-2

EECS 470Computer Architecture

Lecture 2

Coverage: Chapters 1-2

rev 2 2

A Quantitative Approach

• Hardware systems performance is generally easy to quantify– Machine A is 10% faster than Machine B– Of course Machine B’s advertising will show

the opposite conclusion– Example: Pentium 4 vs. AMD Hammer

• Many software systems tend to have much more subjective performance evaluations.

rev 2 3

Measuring Performance

• Use Total Execution Time:– A is 3 times faster than B for programs P1,P2

– Issue: Emphasizes long running programs

1i=1

n

nTimei

rev 2 4


• Weighted Execution Time:– What if P1 is executed far more frequently?

i=1

n

Weighti Timei

Weighti = 1

i=1

Arithmetic mean (AM) =

n

rev 2 5


• Normalized Execution Time:– Compare machine performance to a

reference machine and report a ratio.• SPEC ratings measure relative performance to a

reference machine.

rev 2 6

Example using execution times

CompA CompB

Prog1 1 10

Prog2 1000 100

Total 1001 111

Conclusion: B is faster than A

It is 1001/111 = 9.1 times faster

rev 2 7

Averaging Performance Over Benchmarks

• Arithmetic mean (AM) =

• Geometric mean (GM) =

• Harmonic mean (HM) =

1i = 1n

Timei

n

∏ Timei√n

i = 1

n

i = 1

n

n

Ratei

1

rev 2 8

Which is the right Mean?

• Arithmetic when dealing with execution time

• Harmonic when dealing with rates– flops– MIPS– Hertz

• Geometric mean gives an “equi-weighted” average

rev 2 9

Use Harmonic Mean with Rates

million flops CompA CompB CompC

Prog1 100 1 10 20

Prog2 100 1000 100 20

Total time 1001 111 40

CompA CompB CompC

Prog1 100 10 5

Prog2 0.1 1 5

AM 50.5 5.5 5

GM 3.2 3.2 5

HM 0.2 1.8 5

Rates (mflops) from above table

Notice that the total time ordering is preserved in the HM of the rates

rev 2 10

Normalized Times

• Don’t take AM of normalized execution times

CompA CompBNormalized to A

A B

Normalized to B

A B

Prog1 1 10 1 10 0.1 1

Prog2 1000 100 1 0.1 10 1

AM 500.5 55.0 1 5.05 5.05 1

GM 31.6 31.6 1 1 1 1

which one?

• GM doesn’t track total execution time – last line

which one?

rev 2 11

Notes & Benchmarks

• AM ≥ GM

• GM (Xi) / GM (Yi) = GM (Xi / Yi )• The GM is unaffected by normalizing – it just

doesn’t track execution time• Why does SPEC use it?

– SPEC – system performance evaluation cooperative – http://www.specbench.org/

• EEMBC – benchmarks for embedded applications: embedded microporcessor benchmark consortium– http://www.eembc.org/

http://www.specbench.org/

rev 2 12

Amdahl’s Law

• Rule of Thumb: Make the common case faster

Execution timenew = Execution timeold

(1 - Fractionenhanced) + )Fractionenhanced

Speedupenhanced

(Attack longest running part until it is no longer) repeat

rev 2 13

Instruction Set Design

• Software Systems: named variables; complex semantics.

• Hardware systems: tight timing requirements; small storage structures; simple semantics

• Instruction set: the interface between very different software and hardware systems

rev 2 14

Design decisions

• How much “state” is in the microarchitecture?– Registers; Flags; IP/PC

• How is that state accessed/manipulated?– Operand encoding

• What commands are supported?– Opcode; opcode encoding

rev 2 15

Design Challenges: or why is architecture still relevant?

• Clock frequency is increasing– This changes the number of levels of gates

that can be completed each cycle so old designs don’t work.

– It also tend to increase the ration of time spent on wires (fixed speed of light)

• Power– Faster chips are hotter; bigger chips are

hotter

rev 2 16

Design Challenges (cont)

• Design Complexity– More complex designs to fix frequency/power issues

leads to increased development/testing costs– Failures (design or transient) can be difficult to

understand (and fix)

• We seem far less willing to live with hardware errors (e.g. FDIV) than software errors – which are often dealt with through upgrades – that

we pay for!)

rev 2 17

Techniques for Encoding Operands

• Explicit operands:– Includes a field to specify which state data is

referenced– Example: register specifier

• Implicit operands:– All state data can be inferred from the

opcode– Example: function return (CISC-style)

rev 2 18

Accumulator

• Architectures with one implicit register– Acts as source and/or destination– One other source explicit

• Example: C = A + B– Load A // (Acc)umulator A– Add B // Acc Acc + B– Store C // C Acc

Ref: “Instruction Level Distributed Processing: Adapting to Shifting Technology”

rev 2 19

Stack

• Architectures with implicit “stack”– Acts as source(s) and/or destination– Push and Pop operations have 1 explicit operand

• Example: C = A + B– Push A // Stack = {A}– Push B // Stack = {A, B}– Add // Stack = {A+B}– Pop C // C A+B ; Stack = {}

Compact encoding; may require more instructions though

rev 2 20

Registers

• Most general (and common) approach– Small array of storage– Explicit operands (register file index)

• Example: C = A + BRegister-memory load/store

Load R1, A Load R1, A

Load R2, B

Add R3, R1, B Add R3, R1, R2

Store R3, C Store R3, C

rev 2 21

Memory

• Big array of storage– More complex ways of indexing than

registers• Build addressing modes to support efficient

translation of software abstractions• Uses less space in instruction than 32-bit

immediate field

A[i]; use base (A) + displacement (i) (scaled?)

a.ptr; use base (ptr) + displacement (a)

rev 2 22

Addressing modes

Register Add R4, R3

Immediate Add R4, #3

Base/Displacement Add R4, 100(R1)

Register Indirect Add R4, (R1)

Indexed Add R4, (R1+R2)

Direct Add R4, (1001)

Memory Indirect Add R4, @(R3)

Autoincrement Add R4, (R2)+

rev 2 23

Other Memory Issues

What is the size of each element in memory?

0x000 Byte

Half word

Word

0x000

0x000

0-255

0 - 65535

0 - ~4B

rev 2 24

Other Memory Issues

Big-endian or Little-endian? Store 0x114488FF

0x000 11

44

88

FF

Points to most significant byte

0x000 FF

88

44

11

Points to least significant byte

rev 2 25

Other Memory Issues

Non-word loads? ldb R3, (000)

0x000 11

44

88

FF

00 00 00 11

rev 2 26

Other Memory Issues

Non-word loads? ldb R3, (003)

0x003

11

44

88

FF

FF FF FF FF

Sign extended

rev 2 27

Other Memory Issues

Non-word loads? ldbu R3, (003)

0x003

11

44

88

FF

00 00 00 FF

Zero filled

rev 2 28

Other Memory Issues

Alignment? Word accesses only address ending in 00

Half-word accesses only ending in 0

Byte accesses any address

0x002

11

44

88

FF

ldw R3, (002) is illegal!

Why is it important to be aligned?How can it be enforced?

rev 2 29

Techniques for Encoding Operators

• Opcode is translated to control signals that– direct data (MUX control) – select operation for ALU– Set read/write selects for register/memory/PC

• Tradeoff between how flexible the control is and how compact the opcode encoding.– Microcode – direct control of signals (Improv)– Opcode – compact representation of a set of control

signals.• You can make decode easier with careful opcode

selection (as done in HW1)

rev 2 30

Handling Control Flow

• Conditional branches (short range)

• Unconditional branches (jumps)

• Function calls

• Returns

• Traps (OS calls and exceptions)

• Predicates (conditional retirement)

rev 2 31

Encoding branch targets

• PC-relative addressing– Makes linking code easier

• Indirect addressing– Jumps into shared libraries, virtual functions,

case/switch statements

• Some unusual modes to simplify target address calculation – (segment offset) or (trap number)

rev 2 32

Condition codes

• Flags– Implicit: flag(s) specified in opcode (bgt)– Flag(s) set by earlier instructions (compare, add,

etc.)

• Register– Uses a register; requires explicit specifier

• Comparison operation– Two registers with compare operation specified in

opcode.

rev 2 33

Higher Level Semantics: Functions

• Function call semantics1. Save PC + 1 instruction for return2. Manage parameters3. Allocate space on stack4. Jump to function

• Simple approach: – Use a jump instruction + other instructions

• Complex approach:– Build implicit operations into new “call” instruction

rev 2 34

Role of the Compiler

• Compilers make the complexity of the ISA (from the programmers point of view) less relevant.– Non-orthogonal ISAs are more challenging.– State allocation (register allocation) is better

left to compiler heuristics– Complex Semantics lead to more global

optimization – easier for a machine to do.

People are good at optimizing 10 lines of code.Compilers are good at optimizing 10M lines.

rev 2 35

Next time

• Compiler optimizations

• Interaction between compilers and architectures

• Higher level machine codes (Java VM)

• Starting Pipelining: Appendix A

eecs 470 computer architecture lecture 2 coverage: chapters 1-2

Documents

n slide

n time i n n i

n n time i

n weight i time i weight

execution time harmonic

times faster slide

gm y i

n n n rate