eecs 470 computer architecture lecture 2 coverage: chapters 1-2
Post on 21-Dec-2015
233 views
TRANSCRIPT
EECS 470Computer Architecture
Lecture 2
Coverage: Chapters 1-2
rev 2 2
A Quantitative Approach
• Hardware systems performance is generally easy to quantify– Machine A is 10% faster than Machine B– Of course Machine B’s advertising will show
the opposite conclusion– Example: Pentium 4 vs. AMD Hammer
• Many software systems tend to have much more subjective performance evaluations.
rev 2 3
Measuring Performance
• Use Total Execution Time:– A is 3 times faster than B for programs P1,P2
– Issue: Emphasizes long running programs
1i=1
n
nTimei
rev 2 4
Measuring Performance
• Weighted Execution Time:– What if P1 is executed far more frequently?
i=1
n
Weighti Timei
Weighti = 1
i=1
Arithmetic mean (AM) =
n
rev 2 5
Measuring Performance
• Normalized Execution Time:– Compare machine performance to a
reference machine and report a ratio.• SPEC ratings measure relative performance to a
reference machine.
rev 2 6
Example using execution times
CompA CompB
Prog1 1 10
Prog2 1000 100
Total 1001 111
Conclusion: B is faster than A
It is 1001/111 = 9.1 times faster
rev 2 7
Averaging Performance Over Benchmarks
• Arithmetic mean (AM) =
• Geometric mean (GM) =
• Harmonic mean (HM) =
1i = 1n
Timei
n
∏ Timei√n
i = 1
n
i = 1
n
n
Ratei
1
rev 2 8
Which is the right Mean?
• Arithmetic when dealing with execution time
• Harmonic when dealing with rates– flops– MIPS– Hertz
• Geometric mean gives an “equi-weighted” average
rev 2 9
Use Harmonic Mean with Rates
million flops CompA CompB CompC
Prog1 100 1 10 20
Prog2 100 1000 100 20
Total time 1001 111 40
CompA CompB CompC
Prog1 100 10 5
Prog2 0.1 1 5
AM 50.5 5.5 5
GM 3.2 3.2 5
HM 0.2 1.8 5
Rates (mflops) from above table
Notice that the total time ordering is preserved in the HM of the rates
rev 2 10
Normalized Times
• Don’t take AM of normalized execution times
CompA CompBNormalized to A
A B
Normalized to B
A B
Prog1 1 10 1 10 0.1 1
Prog2 1000 100 1 0.1 10 1
AM 500.5 55.0 1 5.05 5.05 1
GM 31.6 31.6 1 1 1 1
which one?
• GM doesn’t track total execution time – last line
which one?
rev 2 11
Notes & Benchmarks
• AM ≥ GM
• GM (Xi) / GM (Yi) = GM (Xi / Yi )• The GM is unaffected by normalizing – it just
doesn’t track execution time• Why does SPEC use it?
– SPEC – system performance evaluation cooperative – http://www.specbench.org/
• EEMBC – benchmarks for embedded applications: embedded microporcessor benchmark consortium– http://www.eembc.org/
rev 2 12
Amdahl’s Law
• Rule of Thumb: Make the common case faster
Execution timenew = Execution timeold
(1 - Fractionenhanced) + )Fractionenhanced
Speedupenhanced
(Attack longest running part until it is no longer) repeat
rev 2 13
Instruction Set Design
• Software Systems: named variables; complex semantics.
• Hardware systems: tight timing requirements; small storage structures; simple semantics
• Instruction set: the interface between very different software and hardware systems
rev 2 14
Design decisions
• How much “state” is in the microarchitecture?– Registers; Flags; IP/PC
• How is that state accessed/manipulated?– Operand encoding
• What commands are supported?– Opcode; opcode encoding
rev 2 15
Design Challenges: or why is architecture still relevant?
• Clock frequency is increasing– This changes the number of levels of gates
that can be completed each cycle so old designs don’t work.
– It also tend to increase the ration of time spent on wires (fixed speed of light)
• Power– Faster chips are hotter; bigger chips are
hotter
rev 2 16
Design Challenges (cont)
• Design Complexity– More complex designs to fix frequency/power issues
leads to increased development/testing costs– Failures (design or transient) can be difficult to
understand (and fix)
• We seem far less willing to live with hardware errors (e.g. FDIV) than software errors – which are often dealt with through upgrades – that
we pay for!)
rev 2 17
Techniques for Encoding Operands
• Explicit operands:– Includes a field to specify which state data is
referenced– Example: register specifier
• Implicit operands:– All state data can be inferred from the
opcode– Example: function return (CISC-style)
rev 2 18
Accumulator
• Architectures with one implicit register– Acts as source and/or destination– One other source explicit
• Example: C = A + B– Load A // (Acc)umulator A– Add B // Acc Acc + B– Store C // C Acc
Ref: “Instruction Level Distributed Processing: Adapting to Shifting Technology”
rev 2 19
Stack
• Architectures with implicit “stack”– Acts as source(s) and/or destination– Push and Pop operations have 1 explicit operand
• Example: C = A + B– Push A // Stack = {A}– Push B // Stack = {A, B}– Add // Stack = {A+B}– Pop C // C A+B ; Stack = {}
Compact encoding; may require more instructions though
rev 2 20
Registers
• Most general (and common) approach– Small array of storage– Explicit operands (register file index)
• Example: C = A + BRegister-memory load/store
Load R1, A Load R1, A
Load R2, B
Add R3, R1, B Add R3, R1, R2
Store R3, C Store R3, C
rev 2 21
Memory
• Big array of storage– More complex ways of indexing than
registers• Build addressing modes to support efficient
translation of software abstractions• Uses less space in instruction than 32-bit
immediate field
A[i]; use base (A) + displacement (i) (scaled?)
a.ptr; use base (ptr) + displacement (a)
rev 2 22
Addressing modes
Register Add R4, R3
Immediate Add R4, #3
Base/Displacement Add R4, 100(R1)
Register Indirect Add R4, (R1)
Indexed Add R4, (R1+R2)
Direct Add R4, (1001)
Memory Indirect Add R4, @(R3)
Autoincrement Add R4, (R2)+
rev 2 23
Other Memory Issues
What is the size of each element in memory?
0x000 Byte
Half word
Word
0x000
0x000
0-255
0 - 65535
0 - ~4B
rev 2 24
Other Memory Issues
Big-endian or Little-endian? Store 0x114488FF
0x000 11
44
88
FF
Points to most significant byte
0x000 FF
88
44
11
Points to least significant byte
rev 2 25
Other Memory Issues
Non-word loads? ldb R3, (000)
0x000 11
44
88
FF
00 00 00 11
rev 2 26
Other Memory Issues
Non-word loads? ldb R3, (003)
0x003
11
44
88
FF
FF FF FF FF
Sign extended
rev 2 27
Other Memory Issues
Non-word loads? ldbu R3, (003)
0x003
11
44
88
FF
00 00 00 FF
Zero filled
rev 2 28
Other Memory Issues
Alignment? Word accesses only address ending in 00
Half-word accesses only ending in 0
Byte accesses any address
0x002
11
44
88
FF
ldw R3, (002) is illegal!
Why is it important to be aligned?How can it be enforced?
rev 2 29
Techniques for Encoding Operators
• Opcode is translated to control signals that– direct data (MUX control) – select operation for ALU– Set read/write selects for register/memory/PC
• Tradeoff between how flexible the control is and how compact the opcode encoding.– Microcode – direct control of signals (Improv)– Opcode – compact representation of a set of control
signals.• You can make decode easier with careful opcode
selection (as done in HW1)
rev 2 30
Handling Control Flow
• Conditional branches (short range)
• Unconditional branches (jumps)
• Function calls
• Returns
• Traps (OS calls and exceptions)
• Predicates (conditional retirement)
rev 2 31
Encoding branch targets
• PC-relative addressing– Makes linking code easier
• Indirect addressing– Jumps into shared libraries, virtual functions,
case/switch statements
• Some unusual modes to simplify target address calculation – (segment offset) or (trap number)
rev 2 32
Condition codes
• Flags– Implicit: flag(s) specified in opcode (bgt)– Flag(s) set by earlier instructions (compare, add,
etc.)
• Register– Uses a register; requires explicit specifier
• Comparison operation– Two registers with compare operation specified in
opcode.
rev 2 33
Higher Level Semantics: Functions
• Function call semantics1. Save PC + 1 instruction for return2. Manage parameters3. Allocate space on stack4. Jump to function
• Simple approach: – Use a jump instruction + other instructions
• Complex approach:– Build implicit operations into new “call” instruction
rev 2 34
Role of the Compiler
• Compilers make the complexity of the ISA (from the programmers point of view) less relevant.– Non-orthogonal ISAs are more challenging.– State allocation (register allocation) is better
left to compiler heuristics– Complex Semantics lead to more global
optimization – easier for a machine to do.
People are good at optimizing 10 lines of code.Compilers are good at optimizing 10M lines.
rev 2 35
Next time
• Compiler optimizations
• Interaction between compilers and architectures
• Higher level machine codes (Java VM)
• Starting Pipelining: Appendix A