computer evolution and performance

Upload: gerald-manog

Post on 04-Nov-2015

51 views

Category:

Documents


2 download

DESCRIPTION

Computer Organization and Architecture by William Stallings

TRANSCRIPT

  • William Stallings Computer Organization and Architecture8th EditionChapter 2 Computer Evolution and Performance

  • ENIAC - backgroundElectronic Numerical Integrator And ComputerEckert and MauchlyUniversity of PennsylvaniaTrajectory tables for weapons Started 1943Finished 1946Too late for war effortUsed until 1955

  • ENIAC - detailsDecimal (not binary)20 accumulators of 10 digitsProgrammed manually by switches18,000 vacuum tubes30 tons15,000 square feet140 kW power consumption5,000 additions per second

  • von Neumann/TuringStored Program conceptMain memory storing programs and dataALU operating on binary dataControl unit interpreting instructions from memory and executingInput and output equipment operated by control unitPrinceton Institute for Advanced Studies IASCompleted 1952

  • Structure of von Neumann machine

  • IAS - details1000 x 40 bit wordsBinary number2 x 20 bit instructionsSet of registers (storage in CPU)Memory Buffer RegisterMemory Address RegisterInstruction RegisterInstruction Buffer RegisterProgram CounterAccumulatorMultiplier Quotient

  • Structure of IAS detail

  • Commercial Computers1947 - Eckert-Mauchly Computer CorporationUNIVAC I (Universal Automatic Computer)US Bureau of Census 1950 calculationsBecame part of Sperry-Rand CorporationLate 1950s - UNIVAC IIFasterMore memory

  • IBMPunched-card processing equipment1953 - the 701IBMs first stored program computerScientific calculations1955 - the 702Business applicationsLead to 700/7000 series

  • TransistorsReplaced vacuum tubesSmallerCheaperLess heat dissipationSolid State deviceMade from Silicon (Sand)Invented 1947 at Bell LabsWilliam Shockley et al.

  • Transistor Based ComputersSecond generation machinesNCR & RCA produced small transistor machinesIBM 7000DEC - 1957Produced PDP-1

  • MicroelectronicsLiterally - small electronicsA computer is made up of gates, memory cells and interconnectionsThese can be manufactured on a semiconductore.g. silicon wafer

  • Generations of ComputerVacuum tube - 1946-1957Transistor - 1958-1964Small scale integration - 1965 onUp to 100 devices on a chipMedium scale integration - to 1971100-3,000 devices on a chipLarge scale integration - 1971-19773,000 - 100,000 devices on a chipVery large scale integration - 1978 -1991100,000 - 100,000,000 devices on a chipUltra large scale integration 1991 -Over 100,000,000 devices on a chip

  • Moores LawIncreased density of components on chipGordon Moore co-founder of IntelNumber of transistors on a chip will double every yearSince 1970s development has slowed a littleNumber of transistors doubles every 18 monthsCost of a chip has remained almost unchangedHigher packing density means shorter electrical paths, giving higher performanceSmaller size gives increased flexibilityReduced power and cooling requirementsFewer interconnections increases reliability

  • Growth in CPU Transistor Count

  • IBM 360 series1964Replaced (& not compatible with) 7000 seriesFirst planned family of computersSimilar or identical instruction setsSimilar or identical O/SIncreasing speedIncreasing number of I/O ports (i.e. more terminals)Increased memory size Increased costMultiplexed switch structure

  • DEC PDP-81964First minicomputer (after miniskirt!)Did not need air conditioned roomSmall enough to sit on a lab bench$16,000 $100k+ for IBM 360Embedded applications & OEMBUS STRUCTURE

  • DEC - PDP-8 Bus Structure

  • Semiconductor Memory1970FairchildSize of a single corei.e. 1 bit of magnetic core storageHolds 256 bitsNon-destructive readMuch faster than coreCapacity approximately doubles each year

  • Intel1971 - 4004 First microprocessorAll CPU components on a single chip4 bitFollowed in 1972 by 80088 bitBoth designed for specific applications1974 - 8080Intels first general purpose microprocessor

  • Speeding it upPipeliningOn board cacheOn board L1 & L2 cacheBranch predictionData flow analysisSpeculative execution

  • Performance BalanceProcessor speed increasedMemory capacity increasedMemory speed lags behind processor speed

  • Login and Memory Performance Gap

  • SolutionsIncrease number of bits retrieved at one timeMake DRAM wider rather than deeperChange DRAM interfaceCacheReduce frequency of memory accessMore complex cache and cache on chipIncrease interconnection bandwidthHigh speed busesHierarchy of buses

  • I/O DevicesPeripherals with intensive I/O demandsLarge data throughput demandsProcessors can handle thisProblem moving data Solutions:CachingBufferingHigher-speed interconnection busesMore elaborate bus structuresMultiple-processor configurations

  • Typical I/O Device Data Rates

  • Key is BalanceProcessor componentsMain memoryI/O devicesInterconnection structures

  • Improvements in Chip Organization and ArchitectureIncrease hardware speed of processorFundamentally due to shrinking logic gate sizeMore gates, packed more tightly, increasing clock ratePropagation time for signals reducedIncrease size and speed of cachesDedicating part of processor chip Cache access times drop significantlyChange processor organization and architectureIncrease effective speed of executionParallelism

  • Problems with Clock Speed and Login DensityPowerPower density increases with density of logic and clock speedDissipating heatRC delaySpeed at which electrons flow limited by resistance and capacitance of metal wires connecting themDelay increases as RC product increasesWire interconnects thinner, increasing resistanceWires closer together, increasing capacitanceMemory latencyMemory speeds lag processor speedsSolution:More emphasis on organizational and architectural approaches

  • Intel Microprocessor Performance

  • Increased Cache CapacityTypically two or three levels of cache between processor and main memoryChip density increasedMore cache memory on chipFaster cache accessPentium chip devoted about 10% of chip area to cachePentium 4 devotes about 50%

  • More Complex Execution LogicEnable parallel execution of instructionsPipeline works like assembly lineDifferent stages of execution of different instructions at same time along pipelineSuperscalar allows multiple pipelines within single processorInstructions that do not depend on one another can be executed in parallel

  • Diminishing ReturnsInternal organization of processors complexCan get a great deal of parallelismFurther significant increases likely to be relatively modestBenefits from cache are reaching limitIncreasing clock rate runs into power dissipation problem Some fundamental physical limits are being reached

  • New Approach Multiple CoresMultiple processors on single chipLarge shared cacheWithin a processor, increase in performance proportional to square root of increase in complexityIf software can use multiple processors, doubling number of processors almost doubles performanceSo, use two simpler processors on the chip rather than one more complex processorWith two processors, larger caches are justifiedPower consumption of memory logic less than processing logic

  • x86 Evolution (1)8080first general purpose microprocessor8 bit data pathUsed in first personal computer Altair8086 5MHz 29,000 transistorsmuch more powerful16 bitinstruction cache, prefetch few instructions8088 (8 bit external bus) used in first IBM PC8028616 Mbyte memory addressableup from 1Mb8038632 bitSupport for multitasking80486sophisticated powerful cache and instruction pipeliningbuilt in maths co-processor

  • x86 Evolution (2)PentiumSuperscalarMultiple instructions executed in parallelPentium ProIncreased superscalar organizationAggressive register renamingbranch predictiondata flow analysisspeculative executionPentium IIMMX technologygraphics, video & audio processingPentium IIIAdditional floating point instructions for 3D graphics

  • x86 Evolution (3)Pentium 4Note Arabic rather than Roman numeralsFurther floating point and multimedia enhancementsCoreFirst x86 with dual coreCore 264 bit architectureCore 2 Quad 3GHz 820 million transistorsFour processors on chip

    x86 architecture dominant outside embedded systemsOrganization and technology changed dramaticallyInstruction set architecture evolved with backwards compatibility~1 instruction per month added500 instructions availableSee Intel web pages for detailed information on processors

  • Embedded SystemsARMARM evolved from RISC designUsed mainly in embedded systemsUsed within productNot general purpose computerDedicated functionE.g. Anti-lock brakes in car

  • Embedded Systems RequirementsDifferent sizesDifferent constraints, optimization, reuseDifferent requirementsSafety, reliability, real-time, flexibility, legislationLifespanEnvironmental conditionsStatic v dynamic loadsSlow to fast speedsComputation v I/O intensiveDescrete event v continuous dynamics

  • Possible Organization of an Embedded System

  • ARM EvolutionDesigned by ARM Inc., Cambridge, EnglandLicensed to manufacturersHigh speed, small die, low power consumptionPDAs, hand held games, phonesE.g. iPod, iPhoneAcorn produced ARM1 & ARM2 in 1985 and ARM3 in 1989Acorn, VLSI and Apple Computer founded ARM Ltd.

  • ARM Systems CategoriesEmbedded real timeApplication platformLinux, Palm OS, Symbian OS, Windows mobileSecure applications

  • Performance AssessmentClock SpeedKey parametersPerformance, cost, size, security, reliability, power consumptionSystem clock speedIn Hz or multiples ofClock rate, clock cycle, clock tick, cycle timeSignals in CPU take time to settle down to 1 or 0Signals may change at different speedsOperations need to be synchronisedInstruction execution in discrete stepsFetch, decode, load and store, arithmetic or logicalUsually require multiple clock cycles per instructionPipelining gives simultaneous execution of instructionsSo, clock speed is not the whole story

  • System Clock

  • Instruction Execution RateMillions of instructions per second (MIPS)Millions of floating point instructions per second (MFLOPS)Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy

  • BenchmarksPrograms designed to test performanceWritten in high level language Portable Represents style of taskSystems, numerical, commercialEasily measuredWidely distributedE.g. System Performance Evaluation Corporation (SPEC)CPU2006 for computation bound17 floating point programs in C, C++, Fortran12 integer programs in C, C++3 million lines of codeSpeed and rate metricsSingle task and throughput

  • SPEC Speed MetricSingle taskBase runtime defined for each benchmark using reference machineResults are reported as ratio of reference time to system run timeTrefi execution time for benchmark i on reference machineTsuti execution time of benchmark i on test system

    Overall performance calculated by averaging ratios for all 12 integer benchmarksUse geometric meanAppropriate for normalized numbers such as ratios

  • SPEC Rate MetricMeasures throughput or rate of a machine carrying out a number of tasksMultiple copies of benchmarks run simultaneouslyTypically, same as number of processorsRatio is calculated as follows:Trefi reference execution time for benchmark iN number of copies run simultaneouslyTsuti elapsed time from start of execution of program on all N processors until completion of all copies of programAgain, a geometric mean is calculated

  • Amdahls LawGene Amdahl [AMDA67]Potential speed up of program using multiple processorsConcluded that:Code needs to be parallelizableSpeed up is bound, giving diminishing returns for more processorsTask dependentServers gain by maintaining multiple connections on multiple processorsDatabases can be split into parallel tasks

  • Amdahls Law FormulaConclusionsf small, parallel processors has little effectN ->, speedup bound by 1/(1 f)Diminishing returns for using more processorsFor program running on single processorFraction f of code infinitely parallelizable with no scheduling overheadFraction (1-f) of code inherently serialT is total execution time for program on single processorN is number of processors that fully exploit parralle portions of code

  • Internet Resourceshttp://www.intel.com/ Search for the Intel Museumhttp://www.ibm.comhttp://www.dec.comCharles Babbage InstitutePowerPCIntel Developer Home

  • ReferencesAMDA67 Amdahl, G. Validity of the Single-Processor Approach to Achieving Large-Scale Computing Capability, Proceedings of the AFIPS Conference, 1967.