lecture5 en ca principals performance 2014

Computer Architecture

Principals – developments

on the basis of technology

and software. Performance

Measurement Influence of new technologies, Microprocessor

economics, Trends in technology, Importance of measuring performance, Quantitative performance

measurement, Performance metrics, Amdahls’s Law

Computer Architecture

• Computer Architecture (CA) = Instruction Set Architecture (ISA) + Machine organization

• ISA = programmer’s view of a machine

• Factors that influence CA • Technology

• Software

• Application

– IBM coined the term computer architecture in the early 1960s. Amdahl, Blaauw, and Brooks [1964] used the term to refer to the programmer-visible portion of the IBM 360 instruction set

Influence of new technologies

• New technology provides:

• - greater speed

• - smaller size

• - higher reliability

• - lower cost

• - allows designers and engineers to

consider new opportunities

CA - Technology

• Technology is the dominating factor in the design of computers, respectfully – the organization of CA

• The development of technology (transistors, IC, VLSI, Flash memory, Laser disk, CDs) influenced the development of computers

• The development of computers (core memories, magnetic tapes, disks) influenced the development of technology

• VLSI (Very-Large-Scale Integration) is the process of creating integrated

circuits by combining thousands of transistors into a single chip. VLSI began in the 1970s when complex semiconductor and communication technologies were being developed. Nowadays millions, even billions of transistors on a chip

CA - Technology

• The development of both Computer and Technology influenced each other – ROMs, RAMs, VLSI, Packaging, Low power, etc)

• Fast new processors required faster/new peripheral chips, new memory controllers, new I/O, fast and powerful computers facilitate, improve and speed up design, simulation, manufacturing.

• Transistors use power, which means that they generate heat that must be removed. The heat makes the design less reliable. In 20 years voltage have gone from 5V to 1.5V, significantly reducing power

Relative performance per unit cost of technologies

used in computers over time (source H&P)

Year Technology used in

computers

Relative performance/unit

cost

1951 Vacuum tube 1

1965 Transistor 35

1975 Integrated circuit (IC) 900

1995 Very large scale IC 2 400 000

2005 Ultra large scale IC 6 200 000 000

Manufacturing chips

• Silicon (natural element – semiconductor) crystal ingot – a rod composed of a silicon crystal (6-12” in diameter and 12-24’ long) is sliced into wafers < 0.1” thick

• Wafers are processed (patterns of chemicals placed on each wafer, creating transistors _1_layer, conductors _2_to_8_ levels and insulators, separating the conductors)

• Processed wafers are tested for defects and then chopped up (diced) into dies (chips). Bad dies are discarded (yield - % of good dies from total number of dies)

• Good dies are connected to I/O pins of a package by bonding. Packaged parts are tested as mistakes can occur in packing. Then chips are shipped

The chip manufacturing process

Slicer 20 to 40

Processing steps

Blank wafers Silicon ingot

Bond die to

package

Wafer

fester Dicer

Patterned wafers Tested wafer Tester dies

Part

tester

Ship to

customers

Tested packaged dies Packaged dies

Microprocessor economics

• Designing a state of the art Processor requires: – Pentium – 300 engineers

– PentiumPro – 500 engineers

• Huge investments in fabrication lines – – The manufacturer needs to sell in the range of 2 to 4

million units to be profitable

– The design cost of a high-end CPU is on the order of US $100 million (http://www.wordiq.com/definition/CPU_design)

– A microprocessor plant might cost 1.6 billion $ (http://www.gamespot.com/news/6025378.html)

Microprocessor economics

• To stay competitive a company has to fund at least two large design teams to release products at the rate of 2.5 years per product generation. Continuous improvements are needed to improve yields and clock speed.

• Price drops one tenth in 2-3 years.

• Only computer mass market (production rates in the hundreds of millions and billions of dollars in revenue) can support such economics (personal computers, car computers, cell phones, etc)

A task

1. Let us assume that you are in a company marketing a certain IC chip. Costs (fixed), including R&D, fabrication and equipments, etc., add up to $500,000. The cost per wafer is $6000. Each wafer can be diced into 1500 dies. The die yield is 50%. The dies are packaged and tested (at the end), with a cost of $10 per chip. The test yield is 90%. Only those that pass the test will be sold to customers. If the retail price is 40% more than the cost, at least how many chips have to be sold to break even?

Technology driven views of CA

• Implement (old) ISA, using new technology

• An iterative process: – Select new features

– Design datapaths and control

– Estimate cost

– Measure performance with simulators

• After expertise – Hardware and circuit design

– Simulation, verification and testing

– Back-end compilers and performance evaluation

Trends in Technology

• Today’s designers must be aware of rapid changes in implementation technology – some of them are critical to modern implementations:

• - IC logic technology – transistor density increases by about 35% per year

• - Semiconductor DRAM – capacity increases by about 40% per year

• - Magnetic disk technology – before 90-ies – 30% increase per year, 60% thereafter, 100% in 1996, since 2004 – 30% . It is still 50-100 times cheaper than DRAM


• For magnetic discs - new recording technique evolves - vertical (perpendicular) recording

• its areal densities ranges somewhere between 100 and 150 gigabits per square inch

• The magnetization of the bits stands them on end perpendicular to the plane of the disk giving the name “vertical” recording

• Future seek time reductions are expected to be minimal, most of the performance improvements that for disk drives will most likely come through faster rpm

• Disk technology roadmaps indicate that disk drive capacity will approach 5 terabytes by 2015


• Network technology – it depends mainly on the performance of switches and of transmission systems

• networking technology trends go for flexibility and remote access to network resources

• WAN optimization - reduces traffic to remote locations on the network, consolidating data and caching large and frequently used files – this improves application performance for branch locations and remote workers, it also leads to reduced costs for bandwidth

• Another modern trend is the “cloud”, which increases collaboration capability

Scaling transistor performance and wires

• IC processes are characterized by the feature size – min size of a transistor. It was 10 microns in 1971 and nowadays it is 45 nanometers (0.045 microns) !!!

• Reduced size allowed moving from 4 bit to 8, 16, 32 and recently – to 64 bit microprocessors

• In general – transistors improve in performance with decreased feature size, however wires in IC – do not. The signal delay for a wire increases to the product’s resistance and capacitance. Shrinking feature size makes resistance and capacitance worse!

• Wire delay scales poorly compared to transistor performance – major design limitation in recent years

Trends in power in IC

• Power provides challenges as devices grow in size and number of transistors

• Chips have hundreds of pins and multiple interconnect layers, so power and ground must be provided for all parts of the chip.

• For CMOS chips, energy consumption is in switching transistors (i.e. dynamic power). Power is proportional to the square of the voltage, capacitive load and frequency of switching (1/2 Cap.load x Voltage2 x Freq switching)

• Lowering the voltage reduces dynamic power – already is just over 1 V (from 5 V)

Trends in power in IC

• Power is now the major limitation to using transistors – in the past it was raw silicon area

• Most microprocessors today turn off the clock of inactive modules to save energy and dynamic power (if no FP instr are executing, the clock of the FPU is disabled)

• Static power is also becoming an issue because of leakage current flows (even when transistor is off). Leakage current increases with smaller transistor sized.

• In 2006, the goal for leakage was 25% of the total power consumption !!!

• One way to overcome this was placing multiple processors on chip running at lower voltages and clock rates

Software • Software is another important factor,

influencing CA

• Before the mid fifties, software played almost no role in defining architecture

• As people write programs and use computers, our understanding of programming and program behavior improves.

• This has profound though slower impact on computer architecture

• Modern architects cannot avoid paying attention to software and compilation issues

Von Neumann Machines

• Stored program concept – key architectural

concept – paved the way for modern processors

• Program and data are stored in the same

memory

• Program counter (PC) points to the current

instruction in memory, updated on every

instruction, addresses next instruction

• Program words are fetched from sequential

memory locations

Situation in mid 50’s

• Expensive hardware

• Small memory size (1000 words) – No resident system-software!

• Memory access time - 10 to 50 times slower than the processor cycle – Instruction execution time - totally dominated by the memory

reference time.

• The ability to design complex control circuits to execute an instruction was the central design concern as opposed to the speed of decoding or an ALU operation

• Programmer’s view of the machine was inseparable from the actual hardware implementation

Compatibility Problem at IBM

• By early 60’s, IBM had 4 incompatible lines of

computers! • 701, 650, 702, 1401

• Each system had its own • Instruction set

• I/O system and Secondary Storage: magnetic tapes, drums

and disks

• assemblers, compilers, libraries,...

• market niche business, scientific, real time, ...

• This caused problems and led to the creation of IBM 360

Programmer’s view of the machine IBM

650

• A drum machine with 44 instructions

– Instruction: 60 1234 1009

• “Load the contents of location 1234 into the

distribution; put it also into the upper

accumulator; set lower accumulator to zero; and

then go to location 1009 for the next instruction.”

• Good programmers optimized the placement of

instructions on the drum to reduce latency!

• What is an Instruction set?

Instruction set

• An instruction set is the sum of basic operations that a processor can accomplish. A processor’s instruction set is a determining factor in its architecture, even though the same architecture can lead to different implementations by different manufacturers.

• The processor works efficiently thanks to a limited number of instructions, hardwired to the electronic circuits. Most operations can be performed using basic functions. Some architectures, include advanced processor functions

Instruction set

• Independent of machine organization – Machine “families”

– Machines with very different – Organizations

– Capabilities

• Running the same software

• IBM 360 instruction set architecture completely hid the underlying technological differences between various models

• From model 30 until model 70 – the same IS running

Machine organization

• Machine Organization

– Number of functional blocks

– Interconnect pattern

– Transparent to software (affects performance though!)

• Microcode (a layer of hardware-level instructions and/or data structures involved in the implementation of higher level machine code instructions;usually does not reside in the main memory, but in a special high speed memory)

The Earliest Instruction Sets

• Single Accumulator - A carry-over from the calculators. • LOAD, STORE, ADD, SUB, MUL, DIV,SHIFT LEFT, SHIFT RIGHT,

JUMP, JGE, LOAD, HLT…

– Typically less than 2 dozen instructions!

• Processor-Memory Bottleneck: Early Solutions – Fast local storage in the processor

• 8-16 registers as opposed to one accumulator

– Indexing capability • to reduce book keeping instructions (modification for the next

iteration – otherwise have to remember a lot of variables/traces)

– Complex instructions • to reduce instruction fetches

– Compact instructions • implicit address bits for operands, to reduce instruction fetches

Processor State

• The information held in the processor at the end of an instruction to provide the processing context for the next instruction

• Programmer visible state of the processor (and memory) plays a central role in computer organization for both hardware and software: – Software must make efficient use of it

– If the processing of an instruction can be interrupted then the hardware must save and restore the state in a transparent manner

• Programmer’s machine model is a contract between the hardware and software

Classifying Instruction Set

Architectures

• Two classes of register computers

• - can access memory as part of any instruction (register-memory)

• - can access memory only with load and store instruction (load-store)

• Most early computers used stack or accumulator architectures

• Since 1980, almost all use load-store architecture

Operand locations for instruction set

architecture classes

Memory

Processor

Stack Accumulator Reg-

Memory

Reg-Reg/ load-store

ALU

C = A + B • Stack:

– Push A

– Push B

– Add

– Pop C

• Accumulator – Load A

– Add B

– Store C

• Register (reg-memory) – Load R1, A

– Add R3, R1, B

– Store R3, C

• Register (load-store) – Load R1, A

– Load R2, B

– Add R3, R1, R2

– Store R3, C

Classifying Instruction Set

Architectures

• Reasons for emergence of general purpose register computers (GPR)

• - registers are faster than memory

• - registers are more efficient for a compiler to use than other forms of internal storage

• (Ex – a*b – b*c – a*d can be evaluated by doing multiplication in any order – more efficient with pipelining; on stack – evaluation is done only in one order – operands hidden in the stack and they are loaded multiple times)

Performance

• What is performance?

• Is it response time or execution time?

• Response time is the time a system or

functional unit takes to react to a given input

• Execution time is the time to execute the

program - between its start and finish

• Identical - time to complete one task

• Measured in sec., millisec., microsec., nanosec.,

picosec…

Performance, latency and

bandwidth • Another term used is latency

• Latency (response time) is typically measured in nanoseconds for processors and RAM, microseconds for LANs and milliseconds for hard disk access

• Performance is the primary differentiator for microprocessors and networks - they had improved 1000–2000 times in bandwidth and only 20–40 times in latency

• Bandwidth is used for the amount of information that can flow through a network/bus at a given period of time - a data transmission rate; the maximum amount of information (bits/second) that can be transmitted along a channel

Capacity, bandwidth, latency

• Capacity is generally more important than performance for memory and disks, so capacity has improved most - their bandwidth advances were 120–140 times, while their gains in latency were of 4–8 times. Bandwidth has outpaced latency across these technologies and will likely continue to do so

• Figure 1.9 from H&P (page 16 – 4th edition) gives the performance milestones over 20 to 25 years for microprocessors, memory, networks, and disks (1978-2003)

• A task – try to give some figures for 2014!

Importance of measuring

performance

• Real systems:

• Within the free market/during procurement - to decide which system to purchase

• For system maintenance and capacity planning – to predict and plan when an upgrade is needed – either for parts of a system, or for the entire system

• For the applications (i.e. tuning) – to be able to find bottlenecks/hotspots in the application and record them/take actions

Importance of measuring

performance

• During dynamic compilation - to perform

heavy optimizations on application

hotspots

• As a feedback for architects – to find out

what are the performance bottlenecks in a

particular design

• Paper design:

• To be able to compare design alternatives

Development of Quantitative

Performance Measures

• Initially designers set performance goals—ENIAC was to be 1000 times faster than the Harvard Mark-I, and the IBM Stretch (7030) was to be 100 times faster than the fastest machine in existence

• It wasn’t clear how this performance was measured

• The original measure of performance was time to perform an operation (an addition for example as most instructions had the same exec time)



• Execution times of instructions in a machine

became more diverse - hence the time for one

operation was no good for comparisons

• An instruction mix – according to the relative

frequency of instructions across many programs

(early popular example – the Gibson mix –

1970)

• Average instruction execution time =

Instruction time * weight in the mix



• Measured in clock cycles, the average

instruction execution time is the same as

average CPI (clock cycles per instruction)

countn Instructio

program afor cyclesclock CPU CPI

610 timeExecution

countn Instructio MIPS

• Logically and easy to understand – MIPS –

millions of instructions per second (inverse of CPI)

rateclock

CPIcountn Instructio

timecycleClock CPIcountn Instructio timeCPU

Example

• Comp A – clock cycle time = 250 ps, CPI for ProgramX = 2.0

• Comp B - clock cycle time = 500 ps, CPI for ProgramX = 1.2 – number instrutions = I

• Who is faster (same ISA)? By how much?

• CPU clock cyclesA = I x 2.0; CPU timeA =cycles x time = I x 2 x 250 ps = 500 x I ps

• CPU clock cyclesB = I x 1.2; CPU timeB =cycles x time = I x 1.2 x 500 ps = 600 x I ps

• …performance … execution time…

Example

• ProgramX runs on computer A for 15 seconds

• New compiler requires 0.6 as many instructions

• CPI is increased by 1.1

• How fast will run ProgramX on the new compiler?

• Ex timeold = Instruction count x CPI x clock cycle time • Ex timenew = 0.6 Instruction count x 1.1 CPI x clock cycle time

• Ex timenew = 0.6x1.1xInstruction countxCPI x clock cycle time

• Ex timenew = 0.6 x 1.1 x 15 = 9.9



• CPUs became more complex, sophisticated; relied on pipelining and memory hierarchies

• As a CISC machine requires fewer instructions, one with lower MIPS rating might be equivalent to a RISC one with higher rating

• No longer a single execution time per instruction, hence MIPS could not be calculated from the instruction mix

• This was how benchmarking emerged - using kernels and synthetic programs for measuring performance



• Relative MIPS for a machine M was defined

based on some reference machine as:

reference

reference

MM MIPS

ePerformanc

ePerformancMIPS

• The popularity of the VAX-11/780 made it a

popular reference machine for relative MIPS – it

was easy to calculate, so during the early 1980s,

the term MIPS was almost universally used to

mean relative MIPS



• The 70s and 80s the growth of

supercomputer industry and use of

floating-point-intensive programs led to the

introduction of MFLOPS (millions of

floating-point operations per second) - the

inverse of execution time for a benchmark,

so marketing people started quoting peak

MFLOPS



• During the late 1980s, SPEC (System Performance and Evaluation Cooperative) was founded - to improve the state of benchmarking, to have better basis for comparisons

• Initially focused on workstations and servers in the UNIX marketplace

• The first release of SPEC benchmarks (called SPEC89) - a substantial improvement in the use of more realistic benchmarks. SPEC2006 still dominates processor benchmarks almost two decades later

How to measure performance

• Real systems:

• - wall-clock time, response time, or elapsed time

• - operating system timer functions

• - interrupt-driven profiling (gprof)

• - compiler or executable editing to insert software counters

• - external hardware (logic analyzers)

• - integrated performance monitoring hardware (event counters)

• - benchmarks

How to measure performance

• Paper Designs:

• - analytical techniques (queuing theory,

performance models)

• - hand simulation (pencil and paper)

• - software simulation (write program to

model machine)

• - hardware emulation (program FPGAs to

mimic machine)

Performance metrics

• MIPS (millions of instructions per second) – it is only meaningful when running the same executable code on the same inputs

• MFLOPS (millions of floating-point operations per second) – problems - how many FLOPS in a divide? Sqrt? Sine? (1 flop for add, sub, mul; 4 – for div, sqrt, 8 – exp, sine. For example - a kernel with one add, one divide, and one sin would be credited with 13 normalized floating-point operations)

• Some inefficient algorithms have high MFLOPS, hence - only meaningful for the same algorithm

Performance metrics • TPS (transactions per second) – uses TP (transaction-

processing) benchmarks (TPS also means Transaction Processing System)

• Many other similar measures

• - graphics – millions of triangles per second

• - neural networks – millions of connections per second

• All rate measures are 1/time

• Execution time is primary performance metric.

• Also – availability, channel capacity, scalability, performance per watt (cost of powering the computer outweighs the cost of the computer itself), compression ratio!!!

Benchmarks

• The best way of measuring performance are real applications benchmarks (ex - a compiler)

• benchmark suites – collections of benchmark applications - popular measure of performance of processors

• The goal of a benchmark suite - to characterize the relative performance of two computers

• One of the most successful attempts for standardized benchmark application - the SPEC (Standard Performance Evaluation Corporation)

• The evolution of the computer industry led to the need for different benchmark suites, so nowadays there are SPEC benchmarks to cover different application classes (only 3 integer programs and 3 floating-point programs survived three or more generations)

Amdahl’s Law

• Describes the performance gain that can

be obtained by improving some portion of

a computer – defines speedup

• It states that the performance

improvement to be gained from using

some faster mode of execution is limited

by the fraction of the time the faster mode

can be used

Speedup

tenhancemenguwithouttaskentireforePerformanc

possiblewhentenhancemengutaskentireforePerformancSpeedup

sin

sin

possiblewhentenhancemenwithtaskentirefortimeExecution

tenhancemennotaksentirefortimeExecutionSpeedup

Speedup from enhancement depends on 2 factors:

-Fraction of the computation time in the original computer

that can be converted to take advantage of the

enhancement – Fractionenhanced (always less or equal to 1)

-The improvement gained by the enhanced mode – how

much faster the task would run – Speedupenhanced (always

greater than 1)

Amdahl’s Law

enhanced

enhancedenhanced

new

oldoverall

enhanced

enhancedenhancedoldnew

Speedup

FractionFraction

timeExecution

timeExecutionSpeedup

Speedup

FractionFractiontimeExecutiontimeExecution

1

1

1

We introduce 10 times faster processor for Web serving

Processor is busy with computation 40%, waiting for I/O 60% of the time

What is the overall speedup?

Fractionenhanced = 0.4; Speedupenahnced = 10

Speedupoverall =1 / (0.6 + (0.4/10)) = 1/0.64 = 1.56

Amdahl’s Law - discussions

• The incremental improvement in speedup gained by an improvement of just a portion of the computation diminishes as improvements are added

• if an enhancement is used for a fraction of a task only, the task can not be speeded up by more than the reciprocal of (1 – fraction)

• The law is a guide to how much an enhancement will improve performance and how to distribute resources to improve costperformance

Amdahl’s Law

• The goal is to spend resources

proportional to where time is spent

• Useful for comparing the overall system

performance of two alternatives and for

comparing two processor design

alternatives

Task

• A common transformation required in graphics processors is square root. Implementations of FP sqr vary significantly in performance. Suppose FP sqr (FPSQR) is responsible for 20% of the exec time of a critical graph benchmark. One proposal is to enhance the FPSQR hardware and speed this up by a factor of 10. The alternative is just to make all FP instruct fun faster; FP instr are responsible for ½ of the exec time. By how much do the FP instr have to be accelerated to achieve the same performance as achieved by inserting the specialized hardware?

lecture5 en ca principals performance 2014

Documents