computer evolution and performance

William Stallings Computer Organization and Architecture8th EditionChapter 2 Computer Evolution and Performance

ENIAC - backgroundElectronic Numerical Integrator And ComputerEckert and MauchlyUniversity of PennsylvaniaTrajectory tables for weapons Started 1943Finished 1946Too late for war effortUsed until 1955

ENIAC - detailsDecimal (not binary)20 accumulators of 10 digitsProgrammed manually by switches18,000 vacuum tubes30 tons15,000 square feet140 kW power consumption5,000 additions per second

von Neumann/TuringStored Program conceptMain memory storing programs and dataALU operating on binary dataControl unit interpreting instructions from memory and executingInput and output equipment operated by control unitPrinceton Institute for Advanced Studies IASCompleted 1952

Structure of von Neumann machine

IAS - details1000 x 40 bit wordsBinary number2 x 20 bit instructionsSet of registers (storage in CPU)Memory Buffer RegisterMemory Address RegisterInstruction RegisterInstruction Buffer RegisterProgram CounterAccumulatorMultiplier Quotient

Structure of IAS detail

Commercial Computers1947 - Eckert-Mauchly Computer CorporationUNIVAC I (Universal Automatic Computer)US Bureau of Census 1950 calculationsBecame part of Sperry-Rand CorporationLate 1950s - UNIVAC IIFasterMore memory

IBMPunched-card processing equipment1953 - the 701IBMs first stored program computerScientific calculations1955 - the 702Business applicationsLead to 700/7000 series

TransistorsReplaced vacuum tubesSmallerCheaperLess heat dissipationSolid State deviceMade from Silicon (Sand)Invented 1947 at Bell LabsWilliam Shockley et al.

Transistor Based ComputersSecond generation machinesNCR & RCA produced small transistor machinesIBM 7000DEC - 1957Produced PDP-1

MicroelectronicsLiterally - small electronicsA computer is made up of gates, memory cells and interconnectionsThese can be manufactured on a semiconductore.g. silicon wafer

Generations of ComputerVacuum tube - 1946-1957Transistor - 1958-1964Small scale integration - 1965 onUp to 100 devices on a chipMedium scale integration - to 1971100-3,000 devices on a chipLarge scale integration - 1971-19773,000 - 100,000 devices on a chipVery large scale integration - 1978 -1991100,000 - 100,000,000 devices on a chipUltra large scale integration 1991 -Over 100,000,000 devices on a chip

Moores LawIncreased density of components on chipGordon Moore co-founder of IntelNumber of transistors on a chip will double every yearSince 1970s development has slowed a littleNumber of transistors doubles every 18 monthsCost of a chip has remained almost unchangedHigher packing density means shorter electrical paths, giving higher performanceSmaller size gives increased flexibilityReduced power and cooling requirementsFewer interconnections increases reliability

Growth in CPU Transistor Count

IBM 360 series1964Replaced (& not compatible with) 7000 seriesFirst planned family of computersSimilar or identical instruction setsSimilar or identical O/SIncreasing speedIncreasing number of I/O ports (i.e. more terminals)Increased memory size Increased costMultiplexed switch structure

DEC PDP-81964First minicomputer (after miniskirt!)Did not need air conditioned roomSmall enough to sit on a lab bench$16,000 $100k+ for IBM 360Embedded applications & OEMBUS STRUCTURE

DEC - PDP-8 Bus Structure

Semiconductor Memory1970FairchildSize of a single corei.e. 1 bit of magnetic core storageHolds 256 bitsNon-destructive readMuch faster than coreCapacity approximately doubles each year

Intel1971 - 4004 First microprocessorAll CPU components on a single chip4 bitFollowed in 1972 by 80088 bitBoth designed for specific applications1974 - 8080Intels first general purpose microprocessor

Speeding it upPipeliningOn board cacheOn board L1 & L2 cacheBranch predictionData flow analysisSpeculative execution

Performance BalanceProcessor speed increasedMemory capacity increasedMemory speed lags behind processor speed

Login and Memory Performance Gap

SolutionsIncrease number of bits retrieved at one timeMake DRAM wider rather than deeperChange DRAM interfaceCacheReduce frequency of memory accessMore complex cache and cache on chipIncrease interconnection bandwidthHigh speed busesHierarchy of buses

I/O DevicesPeripherals with intensive I/O demandsLarge data throughput demandsProcessors can handle thisProblem moving data Solutions:CachingBufferingHigher-speed interconnection busesMore elaborate bus structuresMultiple-processor configurations

Typical I/O Device Data Rates

Key is BalanceProcessor componentsMain memoryI/O devicesInterconnection structures

Improvements in Chip Organization and ArchitectureIncrease hardware speed of processorFundamentally due to shrinking logic gate sizeMore gates, packed more tightly, increasing clock ratePropagation time for signals reducedIncrease size and speed of cachesDedicating part of processor chip Cache access times drop significantlyChange processor organization and architectureIncrease effective speed of executionParallelism

Problems with Clock Speed and Login DensityPowerPower density increases with density of logic and clock speedDissipating heatRC delaySpeed at which electrons flow limited by resistance and capacitance of metal wires connecting themDelay increases as RC product increasesWire interconnects thinner, increasing resistanceWires closer together, increasing capacitanceMemory latencyMemory speeds lag processor speedsSolution:More emphasis on organizational and architectural approaches

Intel Microprocessor Performance

Increased Cache CapacityTypically two or three levels of cache between processor and main memoryChip density increasedMore cache memory on chipFaster cache accessPentium chip devoted about 10% of chip area to cachePentium 4 devotes about 50%

More Complex Execution LogicEnable parallel execution of instructionsPipeline works like assembly lineDifferent stages of execution of different instructions at same time along pipelineSuperscalar allows multiple pipelines within single processorInstructions that do not depend on one another can be executed in parallel

Diminishing ReturnsInternal organization of processors complexCan get a great deal of parallelismFurther significant increases likely to be relatively modestBenefits from cache are reaching limitIncreasing clock rate runs into power dissipation problem Some fundamental physical limits are being reached

New Approach Multiple CoresMultiple processors on single chipLarge shared cacheWithin a processor, increase in performance proportional to square root of increase in complexityIf software can use multiple processors, doubling number of processors almost doubles performanceSo, use two simpler processors on the chip rather than one more complex processorWith two processors, larger caches are justifiedPower consumption of memory logic less than processing logic

x86 Evolution (1)8080first general purpose microprocessor8 bit data pathUsed in first personal computer Altair8086 5MHz 29,000 transistorsmuch more powerful16 bitinstruction cache, prefetch few instructions8088 (8 bit external bus) used in first IBM PC8028616 Mbyte memory addressableup from 1Mb8038632 bitSupport for multitasking80486sophisticated powerful cache and instruction pipeliningbuilt in maths co-processor

x86 Evolution (2)PentiumSuperscalarMultiple instructions executed in parallelPentium ProIncreased superscalar organizationAggressive register renamingbranch predictiondata flow analysisspeculative executionPentium IIMMX technologygraphics, video & audio processingPentium IIIAdditional floating point instructions for 3D graphics

x86 Evolution (3)Pentium 4Note Arabic rather than Roman numeralsFurther floating point and multimedia enhancementsCoreFirst x86 with dual coreCore 264 bit architectureCore 2 Quad 3GHz 820 million transistorsFour processors on chip

x86 architecture dominant outside embedded systemsOrganization and technology changed dramaticallyInstruction set architecture evolved with backwards compatibility~1 instruction per month added500 instructions availableSee Intel web pages for detailed information on processors

Embedded SystemsARMARM evolved from RISC designUsed mainly in embedded systemsUsed within productNot general purpose computerDedicated functionE.g. Anti-lock brakes in car

Embedded Systems RequirementsDifferent sizesDifferent constraints, optimization, reuseDifferent requirementsSafety, reliability, real-time, flexibility, legislationLifespanEnvironmental conditionsStatic v dynamic loadsSlow to fast speedsComputation v I/O intensiveDescrete event v continuous dynamics

Possible Organization of an Embedded System

ARM EvolutionDesigned by ARM Inc., Cambridge, EnglandLicensed to manufacturersHigh speed, small die, low power consumptionPDAs, hand held games, phonesE.g. iPod, iPhoneAcorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989Acorn, VLSI and Apple Computer founded ARM Ltd.

ARM Systems CategoriesEmbedded real timeApplication platformLinux, Palm OS, Symbian OS, Windows mobileSecure applications

Performance AssessmentClock SpeedKey parametersPerformance, cost, size, security, reliability, power consumptionSystem clock speedIn Hz or multiples ofClock rate, clock cycle, clock tick, cycle timeSignals in CPU take time to settle down to 1 or 0Signals may change at different speedsOperations need to be synchronisedInstruction execution in discrete stepsFetch, decode, load and store, arithmetic or logicalUsually require multiple clock cycles per instructionPipelining gives simultaneous execution of instructionsSo, clock speed is not the whole story

System Clock

Instruction Execution RateMillions of instructions per second (MIPS)Millions of floating point instructions per second (MFLOPS)Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy

BenchmarksPrograms designed to test performanceWritten in high level language Portable Represents style of taskSystems, numerical, commercialEasily measuredWidely distributedE.g. System Performance Evaluation Corporation (SPEC)CPU2006 for computation bound17 floating point programs in C, C++, Fortran12 integer programs in C, C++3 million lines of codeSpeed and rate metricsSingle task and throughput

SPEC Speed MetricSingle taskBase runtime defined for each benchmark using reference machineResults are reported as ratio of reference time to system run timeTrefi execution time for benchmark i on reference machineTsuti execution time of benchmark i on test system

Overall performance calculated by averaging ratios for all 12 integer benchmarksUse geometric meanAppropriate for normalized numbers such as ratios

SPEC Rate MetricMeasures throughput or rate of a machine carrying out a number of tasksMultiple copies of benchmarks run simultaneouslyTypically, same as number of processorsRatio is calculated as follows:Trefi reference execution time for benchmark iN number of copies run simultaneouslyTsuti elapsed time from start of execution of program on all N processors until completion of all copies of programAgain, a geometric mean is calculated

Amdahls LawGene Amdahl [AMDA67]Potential speed up of program using multiple processorsConcluded that:Code needs to be parallelizableSpeed up is bound, giving diminishing returns for more processorsTask dependentServers gain by maintaining multiple connections on multiple processorsDatabases can be split into parallel tasks

Amdahls Law FormulaConclusionsf small, parallel processors has little effectN ->, speedup bound by 1/(1 f)Diminishing returns for using more processorsFor program running on single processorFraction f of code infinitely parallelizable with no scheduling overheadFraction (1-f) of code inherently serialT is total execution time for program on single processorN is number of processors that fully exploit parralle portions of code

Internet Resourceshttp://www.intel.com/ Search for the Intel Museumhttp://www.ibm.comhttp://www.dec.comCharles Babbage InstitutePowerPCIntel Developer Home

ReferencesAMDA67 Amdahl, G. Validity of the Single-Processor Approach to Achieving Large-Scale Computing Capability, Proceedings of the AFIPS Conference, 1967.

computer evolution and performance

Documents

small scale integration

chipmedium scale integration

chiplarge scale integration

small transistor machinesibm

computer evolution

memory cells

cpu transistor countibm

produced pdp