ee382a advanced processor architectureacs.pub.ro/~cpop/smpa/l01-intro 382a.pdf · • architectures...

EE382A – Spring 2009 Christos Kozyrakis Lecture 1 - 1

Department of Electrical Engineering

Stanford University

http://eeclass.stanford.edu/ee382a

EE382A

Advanced Processor Architecture

Christos Kozyrakis & John Shen

EE282 – Autumn 2009 Christos Kozyrakis Lecture 1 - 2

A Few Words About Christos

• Associate professor of EE & CS

– Ph.D. from U.C. Berkeley

– B.Sc. from University of Crete

• Current research

– Parallel systems (scheduling, TM)

– Energy efficient data-centers

– Security systems

– More info at http://csl.stanford.edu/~christos

• Systems I have worked on

– Networking chips: ATLAS & Telegraphos switches

– Processor chips: VIRAM media-processor

• 125 million transistors, 9.6 billion ops/sec

– FPGA prototypes: Raksha & Atlas

– Server prototypes: CoolSort

VIRAM media-processor

IRAM test chip

Telegraphos DSM

switch ATLAS ATM Switch

Raksha Security

System ATLAS TM System


A Few Words About John

• Head of Nokia Research Center in Palo Alto

– Ph.D. from USC

– B.Sc. from University of Michigan

• Prior to Nokia

– Director of the Microarchitecture Research Lab (MRL) at Intel

• Superscalar architecture, speculative multithreading and memory prefetching,

3D die-stacking technology, and heterogeneous multi-sequencer architectures

– Professor of Computer Engineering at CMU

• Author of the main textbook for EE382a


EE382a Team

• Instructors: Christos Kozyrakis & John Shen

• Teaching assistant: David Signiorelli

• Guest lectures: Ben Lee + one more

• Administrative support: Teresa Lynn

• Contact info & office hours: up-to-date info on class webpage

– http://eeclass.stanford.edu/ee382a

– Check frequently


You…

• Class participation is EXTREMELY important in EE382a

• Your goals

– Ask questions

– Offer answers

– Suggest discussion topics

– Make us learn your name

• Will take and post photos of everyone next week


Class Basics

• Lectures: Mo & We, 11am-12.15pm, Hewlett 101

– There will also be some discussion sessions on Fridays

• Friday 2-3pm, Gates Hall 498

• Discussion sessions will be explicitly announced

– The class is not available on SCPD this quarter

• Web page: http://eeclass.stanford.edu/ee382a

– Announcements, handouts, office hours, latest schedule, bulletin board

– Check frequently

– Signup with webpage for on-line access to grades

• We will let you know when registration is open…


The Bulletin Board

• The preferred way to ask class-related questions

– We promise to check & answer often, especially close to deadlines

– We encourage you to contribute to answers & have on-line discussions on

class material

• The bulletin rules

– Before posting a new question

• Check if question has already been asked or even answered

– Use the search capabilities of your web browser

• Check the FAQ page for the assignment

– Choose an appropriate subject for your question

• E.g. “HW2, problem 3, definition of memory latency”

• For questions not appropriate for the public: send us an email


EE382a Topics

• Pipelining overview and analysis

• Architectures for instruction level parallelism

– Supersalar: instruction fetch, branch prediction, dynamic scheduling &

register renaming, memory disambiguation

– VLIW and dynamic binary translation

• Architecture for task and data level paralellism

– Multithreading, multi-core architectures, vector processing, GPUs, tradeoffs

in designing multi-core chips, memory hierarchy for multi-core

• Cross-cutting issues

– Checkpointed processors, phase-change memory, …


Textbooks and Papers

• Textbooks

– Required: "Modern Processor Design: Fundamentals of Superscalar

Processors", J.P. Shen and M. Lipasti, 1st edition, McGraw-Hill

• Do not use/buy the beta edition!

– Reference: “Computer Architecture: A Quantitative Approach”, J. Hennessy

& D. Patterson, 4th edition, Morgan Kaufmann

– Reference: “Computer Organization and Design: The Hardware/Software

Interface”, D. Patterson & J. Hennessy, 4th edition, Morgan Kaufmann

• Papers (check handouts link on the webpage)

– A few required papers

• These papers are included in the exam materials

• Have to submit a 1-page paper summary by the next lecture

– Several optional papers

• Further in-depth information, references for projects, …


Assignments, Exams, and Class Load

• Single exam and 1+2 homework assignments

• Large research project

– On an open question in computer architecture

– Work in groups of up to 3 students

– See topic suggestions on-line or suggest your own project

– Milestones: proposal, halfway review/status, presentation, paper…

• Grade breakdown (tentative)

– Exam 40%, Project 40%, HW + summaries + participation 20%

– All deadlines are final, no extensions, no exceptions

– Remember the honor code (more info on web page)

• Warnings

– This will be a loaded class!!

– This class will be as good as your participation…


Prerequisites and Registration

• Prerequisites: EE108B or equivalent

– Expected to know: simple pipelines, basic caching, virtual memory, main memory

• EE282 is not a required prerequisite

• Class registration:

– Limited to 30 students; all students must receive instructor’s approval

• Homework 1: prerequisite assessment

– Due on in-class on Monday

– Work on it on your own

– Will send you email about your registration by Wednesday


Should I Take EE382A?

• Good reason to take EE382A

– Prepare for research in computer architecture

– Broaden your Ph.D. research perspective

– Become a digital systems architect in industry

– Honest curiosity (how do Intel/AMD/… processors work?)

– Want to take a class with a research project

• Not a good reason to take EE382A

– Prepare for quals, comps, etc…

– Need another course for your degree program

• “EE382A is supposed to be an easy A, right?”

– Learn about digital circuits and CAD tools


On Reading & Summarizing Papers

• Look for the following

– The issue or problem addressed by the paper

– The original contributions (real or claimed, you have to check)

– Critique: what are the major strengths and weaknesses of the papers?

• Look at the claims and assumptions, the methodology, the analysis of data, and the presentation style

– Future work: what are the natural extensions or improvements to this work?

• Or, can we apply a similar methodology to other problems of interest

• Do not submit the paper abstract as your summary :)

• Helpful tips

– Read the abstract, introduction, and conclusions sections first.

– Read the rest of the paper twice

• First a quick pass to get rough idea of details, then a detailed reading

– Underline/highlight the important parts of the paper

– Keep notes on the paper margins about comments or questions

• Important insights, questionable claims, relevance to other topics, ways to improve some technique etc.

– Look up references that seem to be important or missing

• In some cases, you may also want to check who and how references this paper

EE382A – Spring 2009 Christos Kozyrakis Lecture 1 - 14

Department of Electrical Engineering

Stanford University

http://eeclass.stanford.edu/ee382a

EE382A Lecture 1:

Introduction to Advanced Processor Architecture


Historical Perspectives on Processors

• The Decade of the 1970’s: “Birth of Microprocessors”

– Programmable Controller

– Single-Chip Microprocessors

– Personal Computers (PC)

• The Decade of the 1980’s: “Quantitative Architecture”

– Instruction Pipelining

– Fast Cache Memories

– Compiler Considerations

– Workstations

• The Decade of the 1990’s: “Instruction-Level Parallelism”

– Superscalar,Speculative Microarchitectures

– Aggressive Compiler Optimizations

– Low-Cost Desktop Supercomputing


Performance Growth

• Doubling every 18 months (1982-2000):

– total of 3,200X

– Cars travel at 176,000 MPH; get 64,000 miles/gal.

– Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200)

– Wheat yield: 320,000 bushels per acre

• Doubling every 24 months (1971-2001):

– total of 36,000X

– Cars travel at 2,400,000 MPH; get 600,000 miles/gal.

– Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000)

– Wheat yield: 3,600,000 bushels per acre

Unmatched by any other industry!!

[John Crawford, Intel, 1993]


Convergence of Key Enabling Technologies

• CMOS VLSI:

– Submicron feature sizes: 0.3u 0.25u 0.18u 0.13u 90n 65n 45nm…

– Metal layers: 3 4 5 6 7 (copper) 12 …

– Power supply voltage: 5V 3.3V 2.4V 1.8V 1.3V 1.1V …

• CAD Tools:

– Interconnect simulation and critical path analysis

– Clock signal propagation analysis

– Process simulation and yield analysis/learning

• Architecture & Microarchitecture:

– Superpipelined and superscalar machines

– Speculative and dynamic microarchitectures

– Simulation tools and emulation systems

• Compilers: – Extraction of instruction-level parallelism

– Aggressive and speculative code scheduling

– Object code translation and optimization


Evolution of Single-Chip Processors

1970’s 1980’s 1990’s 2010

Transistor Count 10K-100K 100K-1M 1M-100M 0.5-1B

Clock Frequency 0.2-2MHz 2-20MHz 20M-1GHz 1-5GHz

Instruction/Cycle < 0.1 0.1-0.9 0.9- 2.0 10

MIPS or MFLOPS < 0.2 0.2-20 20-2,000 100,000

Watt < 2 <10 <40 1-100+ (?)

CPUs/chip` 1 1 1 4-10


Aspects of Computer Architecture

• ARCHITECTURE (instruction set architecture)

– programmer/compiler view - “Functional appearance to its immediate user/

system programmer”

• IMPLEMENTATION (microarchitecture)

– processor designer view - “Logical structure or organization that

implements the instruction set”

• DESIGN (chip realization)

– chip/system designer view - “Physical structure that embodies the

implementation”


Our Objective for this Quarter

• The “What’s-How’s-Why’s” of Processor Design

1. Knowledge (“what’s”)

- Technology

- Techniques

2. Design Skills (“how’s”)

- Critical Issues

- Trade-off Intuitions

3. Understanding (“why’s”)

- Deeper Insights

- Fundamental Principles


Basic Tools and Principles for Architects


f

Amdahl’s Law

• Speedup= timewithout enhancement / timewith enhancement

• Suppose an enhancement speeds up a fraction f of a task by a

factor of S

timenew = timeold·( (1-f) + f/S )

Soverall = 1 / ( (1-f) + f/S )

(1 - f)

timeold

(1 - f)

timenew

f/S


Amdahl’s Law (continued)

• Real life analogy: After driving through 60 minutes of traffic jam, how

much time can you make up by speeding in the final mile?

• Applications in Computer Architecture

– RISC - Reduced Instruction Set Computer

– Optimized to execute frequently used instructions quickly

– Infrequently used instructions take longer, or even emulated with SW

We should concentrate efforts on improving frequently occurring events or

frequently used mechanisms


Pipelining

• Latency : Elapsed time from start to completion of a particular task

• Throughput : How many tasks can be completed per unit of time

• A pipeline is like an assembly line!

• Pipelining only improves throughput

– Latency: each job still takes 5 cycles to complete

– Throughput: 1 job per cycle if pipelined vs. 1 job per 5 cycles if not pipelined

stage1 stage2 stage3 stage4 stage5

start finish


Pipelining (continued)

• Real life analogy: Henry Ford’s automobile assembly line.

• Example in computer architecture:

– 5-stage Instruction Execution Pipeline

– Fetch-Decode-Execute-Memory-Writeback

time

Stages t0 t1 t2 t3 t4 t5 t6 t7 . . . .

Fetch I1 I2 I3 I4 I5

Decode I1 I2 I3 I4 I5

Execute I1 I2 I3 I4 I5

Memory I1 I2 I3 I4 I5

Writeback I1 I2 I3 I4 I5


Parallel Processing

• Parallelism - the amount of independent sub-tasks available

• If sub-tasks are independent, the order that they are carried out does

not matter

• Thus by executing the independent subtasks concurrently, we can

finish the entire task faster

Improve Speedup!!!


Parallel Processing

• Real life analogy: collaboration on problem sets

(although not always encouraged)

• Examples in computer architecture:

– Parallel computers

– Superscalar processors

– Multi-core processors


Our-of-order Execution

• Specification (or Program) Order vs Dataflow Order

• Dataflow: Data-driven scheduling of events

– The start of an event should be enabled by the availability of its required

input (data dependency)

– The completion of an event will produce an output that will enable the start

of other events

x = a + b; y = b * 2 z = (x-y) * (x+y)

+

+-

*

*2

a b

xy


Our-of-order Execution

• Real life analogy:

– A tip on taking tests: work on the questions you know first

• Examples in computer architecture

– Most modern microprocessors (Intel P4, Opteron etc) all schedule

instruction execution in dataflow order


Work and Critical Path

• Work

T1 - time to complete a computation on a

sequential system

• Critical Path

T - time to complete the same computation

on an infinitely-parallel system

• Average Parallelism

Pavg = T1 / T

• For a p wide system

Tp max{ T1/p, T }

Pavg>>p Tp T1/p

+

+-

*

*2

a b

xy

x = a + b; y = b * 2 z =(x-y) * (x+y)


Work and Critical Path

• Real life analogy: undergraduate degree requirements

– Work = unit requirement

– Critical Path to graduation is determined by course sequences and their

prerequisites

• Added constraints: classes are only available on specific quarters…

• Applications to computer architecture

– Parallel job scheduling

– Given a collection of inter-dependent task:

• How much resources should be allocated?

• Which sequence of tasks should be given priority?


Speculation

Is it possible to parallelize the critical path?

i.e. violate data dependence?

• Guess the outcome of an operation from its inputs without performing

the operation

• Even better, guess the outcome of an operation before the inputs to

the operation are even known

• Speculation techniques must also include mechanisms for

1. Checking if the guesses are correct

2. Undoing “speculative execution” after wrong guesses


Speculation (continued)


– Another tip on taking tests: You can often guess what is going to be on an

exam by looking at lectures and HWs.


– Circuit-level speculations: Carry Select Adder

– Architectural-level speculations

• Branch target predictions

• Load value predictions

• Speculative loop execution


Locality Principle

• One’s recent past is a very good indication of his near future

– Temporal Locality: If you just did something, it is very likely that you will do

the same thing again soon

– Spatial Locality: If you just did something, it is very likely you will do some

thing related or similar next

• Locality == Patterns == Predictability

– Converse:

• Anti-locality : If you haven’t done something for a very long time, it is very likely

you won’t do it in the near future either


Locality Principle (continued)


– spatial locality - where you choose to sit in a room

– temporal locality - will you be here again next week?


– Execution of program loops

• Spatial locality - after you execute an instruction, with very good probability, you

will execute the next instruction

• Temporal locality - you are very likely to repeat the same instructions many

times


Memoization

• If something is expensive to compute, you might want to remember the

answer for a while, just in case you will need the same answer again

Why does memoization work??


– Keeping a list of frequently used phone numbers by your telephone


– ?


Amortization

• Overhead cost : one-time cost to set something up

• Per-unit cost : cost for per unit of operation

total cost = overhead + per-unit cost x N

• It is often okay to have a high overhead cost if the cost can be

distributed over a large number of units

low the average cost

average cost = total cost / N

= ( overhead / N ) + per-unit cost


Amortization (continued)

• Real life analogy: economy of scale

– Why is pasta sauce cheaper when bought by the gallon?


Cache Access Latency

Tmiss= 50 cycles

Thit = 1 cycle

If on the average a cache line is reused n times before being ejected

Tave = ( Tmiss+ (n-1)Thit ) / n Tmiss / n + Thit

n = 50 Tavg 2

n = 2 Tavg 25


Basic Equations and Metrics

• Performance

– CPUtime = Instruction Count * CPI * Clock Cycle Tie

– AMAT = Hit Time + Miss Rate * Miss Penalty

– Amdahl’s law, amortization

• Cost

– Processor cost = f(die area4)

• Power Consumption

– Power = C*Vdd2*F + Vdd*Ishortcircuit*F + Vdd*Ileakage

– Energy = Power * Time

– E*D, E*D2, ED3, …

• Fault tolerance: MTTF, MTTR, …

• Design complexity: ?


Ready to Learn More?

ee382a advanced processor architectureacs.pub.ro/~cpop/smpa/l01-intro 382a.pdf · • architectures...

Documents