computer architecturecomputer-architecture.org/lectures/computer-architecture-2.pdf · computer...
Post on 26-Feb-2018
225 Views
Preview:
TRANSCRIPT
Computer ArchitecturePaul Mellies
Lecture 2 : Princeton, Harvard and data�ow machines
OUT
FalseTrue
bool
bool
copy
copy
copy
branch
dec
branch
> 0
*
1
int
int
int
bool
int
int int
intint
FalseTrue
int
IN
int
Why have computers have become more complex ?
Why have computers have become more complex? We can think of several reasons.
Speed of Memory vs. Speed of CPU. John Cocke says that the com-plexity began with the transition from the 701 to the 709. The 701 CPU was about ten times as fast as the core main memory ; this made any primitives that were implemented as subroutines much slower than primitives that were instructions. Thus the �oating point subroutines became part of the 709 architecture with dramatic gains. Making the 709 more complex resulted in an advance that made it more cost-e�ec-tive that the 701. Since then, many « higher-level » instructions have been added to machines in an attempt to improve performances.
David Patterson and David DitzelThe Case for the Reduced Instruction Set ComputerACM Computer Architecture News, 1980
Diverging processor and memory performance
Processor
Memory10
100
1000
10,000
100,000
11980 1985 1990 1995 2000 2005 2010
The memory hierarchy
https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Where the data is located
register
Timeto fetch data
L1 cache
L2 cache
L3 cache
Local DRAM Memory
1 cycle
~4 cycles
~10 cycles
40 - 75 cycles
~60 ns
Remote DRAM Memory ~100 ns
For more information, please have a look at INTEL’s performance analysis guide for Core i7 and Xeon 5500:
A good programmer should be aware of these memory latencies and do his/her best to maximize the amount of data available in the cache.
A good idea is to keep the manipulated data as local as possible ( e.g. use arrays instead of linked lists )
Exercise
Compute the number of cycles performed in 60 nsby an INTEL Core i7 processor working at 4 GHz.
Upward Compatibility
Upward compatibility means that the primary way to improve a design is to add new, and usually more complex, features. Seldom are instruc-tions or addressing modes removed from an architecture, resulting in a gradual increase in both the number and complexity of instructions over a series of computers. New architectures tend to have a habit of including all instructions found in the machines of successful competi-tors, perhaps because architects and customers have no real grasp over what de�nes a « good » instruction set.
David Patterson and David DitzelThe Case for the Reduced Instruction Set ComputerACM Computer Architecture News, 1980
2.20 Concluding Remarks 161
2.20 Concluding Remarks
� e two principles of the stored-program computer are the use of instructions that are indistinguishable from numbers and the use of alterable memory for programs. � ese principles allow a single machine to aid environmental scientists, � nancial advisers, and novelists in their specialties. � e selection of a set of instructions that the machine can understand demands a delicate balance among the number of instructions needed to execute a program, the number of clock cycles needed by an instruction, and the speed of the clock. As illustrated in this chapter, three design principles guide the authors of instruction sets in making that delicate balance:
1. Simplicity favors regularity. Regularity motivates many features of the MIPS instruction set: keeping all instructions a single size, always requiring three register operands in arithmetic instructions, and keeping the register � elds in the same place in each instruction format.
2. Smaller is faster. � e desire for speed is the reason that MIPS has 32 registers rather than many more.
3. Good design demands good compromises. One MIPS example was the compromise between providing for larger addresses and constants in instructions and keeping all instructions the same length.
Less is more.Robert Browning, Andrea del Sarto, 1855
0
100
200
300
400
500
600
700
800
900
1000
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
Year
Num
ber
of In
stru
ctio
ns
FIGURE 2.43 Growth of x86 instruction set over time. While there is clear technical value to some of these extensions, this rapid change also increases the di� culty for other companies to try to build compatible processors.
In�ation of the x86 instruction set over time
The price to pay ( among other things ) for backward compatibility...
Illustration : Intel Haswell i7 core ( 2013 )
die size ≈ 177 mm2clock rate ≈ 3 GHz
22 nm FinFET technology
number of transistors per die ≈ 1 400 000 000
All Haswell models designed to support MMX, SSE, SSE2, SSSE3, SSE4.1, SSE4.2, SSE4.2, F16CBMI1+BMI2, EIST, Intel 64, XD bit, Intel VT-x and Smart Cache
How much of a CISC is used ?
One of the interesting results of rising software costs is the increasing reliance on high-level languages. One consequence is that the compiler writer is replacing the assembly language programmer as deciding which instructions the machine will execute. Compilers are often unable to utilize complex instructions, nor do they use the insidious tricks in which assembly language programmers delight. [...]
For example, measurements of a particular IBM 360 compiler found that 10 instructions accounted for 80% of all instructions executed, 16 for 90%, 21 for 95%, and 30 for 99%.
David Patterson and David DitzelThe Case for the Reduced Instruction Set ComputerACM Computer Architecture News, 1980
But are you really convinced by the argument ?
Growth in clock rates
The Power Wall
Figure 1.16 in Patterson & Hennessy
Clock rate and Power for Intel x86 microprocessors over eight generations and 25 years
A simple formula for computingthe dynamic energy
and the dynamic power of a CMOS
Dynamic energy ≈ Capacitive load of the transistor x Voltage2
for a logic transition 0 1 0
Dynamic energy ≈ 1/2 x Capacitive load of the transistor x Voltage2
for a single transition 0 1 or 0 1
Note that there is also static energy consumption in CMOS technologybecause of leakage current that �ows even when the transistor is o�.
Dynamic power ≈ 1/2 x Capacitive load of the transistor x Voltage2 x Frequency
The Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6 GHz and voltage of 1.25 V. Assume that, on average, it consumed 10 W of static power and 90 W of dynamic power. The Core i5 Ivy Bridge, released in 2012, had a clock rate of 3.4 GHz and voltage of 0.9 V. Assume that, on average, it consumed 30 W of static power and 40 W of dynamic power.
a. For each processor �nd the average capacitive loads.
b. Find the percentage of the total dissipated power comprised by static power and the ratio of static power to dynamic power for each technology. c. If the total dissipated power is to be reduced by 10%, how much should the voltage be reduced to maintain the same leakage current? Note: power is de�-ned as the product of voltage and current.
Exercise [ from Patterson & Hennessy 1.8 ]
The industry turning to multicore architectures
The computer industry is undergoing, if not another revolution, cer-tainly a vigorous shaking-up. The major chip manufacturers have, for the time being at least, given up trying to make processors run faster. Moore’s law has not been repealed: each year, more and more transistors �t into the same space, but their clock speed cannot be increased wit-hout overheating. Instead, manufacturers are turning to « multicore » architectures, in which multiple processors ( cores ) communicate directly through shared hardware caches. Multiprocessor chips make computing more effective by exploiting parallelism : harnessing mul-tiple processors to work on a single task.
Maurice Herlihy and Nir ShavitThe Art of Multiprocessor ProgrammingMorgan Kaufmann Publishers, 2008.
The Princeton architecture
Program and data are stored in just the same place
Purely sequential instruction processing
also known as the Von Neumann architecture
• A price to pay: the interpretation of a value stored in memory depends now on a control signal
• Clever unification of the notion of « memory »
A serious risk of misunderstanding data for code ...But at the same time, great for self-modifying code !!!
• Exactly one instruction is processed at a time
• To that purpose, a special register called program counter pc identifies the position in memory of the current instruction
• The program counter pc is advanced sequentially by every instruction except in the case of control tranfer instructions like goto’s or beq’s
The Princeton architecturealso known as the Von Neumann architecture
Program counter pcis special register which
contains the address in memoryof the current instruction
0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 0 1 0 0 0
Memory
0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1
currentinstruction}
The Princeton architecturealso known as the Von Neumann architecture
Program counter pcis special register which
contains the address in memoryof the current instruction
0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 0 1 1 0 0
Memory
0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1
currentinstruction}
The Princeton architecturealso known as the Von Neumann architecture
Program counter pcis special register which
contains the address in memoryof the current instruction
0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 1 0 0 0 0
Memory
0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1
currentinstruction}
The Princeton architecturealso known as the Von Neumann architecture
Program counter pcis special register which
contains the address in memoryof the current instruction
0 1 0 1 0 1 0 10 1 0 0 1 1 0 10 1 1 1 0 1 1 00 0 1 1 0 1 0 0
Memory
0 0 1 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 0 1 0 0 10 0 0 0 0 0 0 00 0 0 0 0 0 0 10 0 0 0 0 0 0 01 0 0 0 1 0 0 10 1 0 1 0 0 0 00 0 1 0 1 0 1 00 0 0 1 0 1 0 10 1 0 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 1 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 1
currentinstruction}
In the case of the MIPS instruction set ...
0 0 0 0 0 0
1 0 0 0 0 0
rs rsrs rs rs rt rt rt rt rtrd rd rd rd rd
Add Instruction
Register [ rd ] = Register [ rs ] + Register [ rt ]
0 0 0 0 0
In the case of the MIPS instruction set ...
0 0 0 0 0 0
1 0 1 0 1 0
rs rsrs rs rs rt rt rt rt rtrd rd rd rd rd
Slt InstructionSet on less than (signed)
If $rs is strictly less than $rt, then $rd is set to one. $rd is set to zero otherwise.
0 0 0 0 0
In the case of the MIPS instruction set ...
0 0 1 0 0 0
rs rsrs rs rs rt rt rt rt rtim im im im im im im imim im im im im im im im
Add ImmediateInstruction
Register [ rt ] = Register [ rs ] + Immediate
In the case of the MIPS instruction set ...
0 0 0 1 0 1
rs rsrs rs rs rt rt rt rt rtim im im im im im im imim im im im im im im im
BNE Instruction
Branches if the two registers are not equal and carries on otherwise.
Branch on Not Equal
In the case of the MIPS instruction set ...
Jump Instruction
Jump to the addressof memory
im im0 0 0 0 1 0im im im im im im im imim im im im im im im imim im im im im im im im
0 0
im im im im im im im imim im im im im im im imim im im im im im
im im im impc pc pc pc
fetch instruction
instruction decode
fetch operands
execute
write back
The instruction cycleof the Princeton architecture
Each step of the execution cycle startsonly after the previous step has beencompleted, in a purely sequential order.
In particular, one needs to decodethe instruction before getting its operands.
Key principle :
Central Processing Unit( CPU )
Control Unit
ALU
Registers
MainMemory
SecondaryMemory
Storage
Keyboard
Mouse
InputDevices
Display
Printer
OutputDevices
Bus
The Princeton architecture
Central Processing Unit( CPU )
Control Unit
ALU
Registers
DataMemory
SecondaryMemory
Storage
Keyboard
Mouse
InputDevices
Display
Printer
OutputDevices
DataBus
The Harvard architecture
CodeMemory
CodeBus
( typically ROM )
Separation of memory into « code memory » and « data memory »
The instruction cycleof the Harvard architecture
Key idea : fetch instruction
instruction decode
fetch operands
execute
write back
Thanks to the separation between the code bus and the data bus, the instruction and its operands may be fetched in memory at the same step of the instruction cycle !
Disturbing fact : there is no de�nite « state » of the machine.
One step further in parallelism :the data-�ow machines
L3 L1 + L2
L4 L3 L1*L5 L3 / L6
L6 L4 + L2
x L6
y L5
Data-Flow Program extracted from Dennis and Misunas 1974 paper« A preliminary architecture for a basic data-�ow processor »
Corresponding sequential program
Simple data-�ow program
L1 aL2 b
Data-Flow machines
• A program is not defined as a sequence of instructions but as a graph of instructions -- also called data-flow nodes.
The execution is data-driven rather than control-driven
Intrinsically parallel instruction processing
• Each instruction or data-flow node « waits » for its operands and « fires » as soon as all of them are available
• In particular, there is no need for a code pointer pc
There is no precise execution state
Data Flow Nodes
intint
int
False True
intint
int
bool
Barrier Synch
copy branch
boolbool
bool
copy
COPY BRANCH
BARRIER SYNCH
Data Flow Nodes
intint
int
False True
intint
int
bool
Barrier Synch
copy branch
boolbool
bool
copy
COPY BRANCH
BARRIER SYNCH
False
n3 True
5 7 4
Data Flow Nodes
intint
int
FalseTrue
intint
int
bool
Barrier Synch
copy branch
boolbool
bool
copy
COPY BRANCH
BARRIER SYNCH
n3 True
5 7 4
3 True
Data Flow Nodes
int int
bool
<
RELATION
Data Flow Nodes
int int
bool
<
RELATION
3 5
Data Flow Nodes
int int
bool
<
RELATION
True
OUT
FalseTrue
bool
bool
copy
copy
copy
branch
dec
branch
> 0
*
1
Exercise :
Find out what functionthis data �ow program computes !
int
int
int
bool
int
int int
intint
FalseTrue
int
IN
int
Open discussionShould there be a code pointer in any reasonable Instruction Set Architecture ?
Is it possible/intuitive/safe to program and/or compile in a data-�ow architecture ?
This is a very serious and interesting debate !!!
Today, all the major ISAs are based on the Princeton architecture:
In contrast, their optimized microarchitectures take full advantage of parallelism:
• x86 • ARM • MIPS • SPARC • POWER
And what about debugging a data-�ow program ?
Current trade-o� between data-driven and control-driven execution:
Electricity is parallel but the Programmer’s Mind is sequential
But do you believe yourselfin that accepted view ?
• pipelined instruction execution • multiple instructions at a time • out-of-order execution
This can be summarized in a slogan:
The underworld of microarchitecture
What the User/Programmer can see
Architecture
Microarchitecture
Sequential Instruction Set
Parallel implementation
Generally not exposed to the User/Programmer
The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the orga-nization of the data �ow and controls, the logical design, and the physical implementation.
Recall the origins of the word « architecture »
Amdahl, Blaauw, BrooksArchitecture of the IBM System / 360IBM Journal of Research and DevelopmentApril 1964
organization of the data �ow = microarchitecturelogical design = digital logic
physical implementation = circuit
One apparent di�culty :the notion of internal state of the system is lost
First of all, it may be di�cult to guess in what state is the microarchitecture at a given point of the execution of the machine code.
More conceptually, there is the temptation to reason on the result of the execution of machine code independently on the microarchitecture.
This is the direction taken by the so-called « memory models » like
• the Java memory model developed in 1995 • more recently, the C11 memory model.
There memory models are typically de�ned using partial orders expressing that an instruction « happened before » another one during the execution.
Out-of-order execution in a multiprocessor scenario
x 1
r1 y
y 1
r2 x
Thread 1 Thread 2
Consider these two threads and run them in parallelin an x86 or a Power multiprocessor :
Suppose also that x and y have value 0 before execution.
Question : how many results of the executions are possible ?
Out-of-order execution in a multiprocessor scenario
x 1
r1 y
y 1
r2 x
Thread 1 Thread 2
Consider these two threads and run them in parallelin an x86 or a Power multiprocessor :
Suppose also that x and y have value 0 before execution.
Question : how many results of the executions are possible ?
Somewhat surprisingly, the correct answer is 4.In particular, the outcome r1=0 and r2=0 is also possible.
Can you explain what happened?
Any question ?
Thank you !
top related