lec jan12 2009

Post on 11-May-2015

1.790 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction12th January, 2009

CSL718 : Architecture of High Performance Systems CSL718 : Architecture of CSL718 : Architecture of

High Performance SystemsHigh Performance Systems

Anshul Kumar, CSE IITD slide 2

Some basic questionsSome basic questionsSome basic questions

• What is high performance?

• Who needs high performance systems?

• How do you achieve high performance?

• How to analyse or evaluate performance?

– Rate of computation– Time to compute– Weather prediction,

complex design, scientific computation etc.

– Every one needs it.– Technology– Circuit / logic design– Architecture– Theoretical models– Simulation– Experimentation

Anshul Kumar, CSE IITD slide 3

Execution Time and Clock PeriodExecution Time and Clock PeriodExecution Time and Clock Period

Program exec time = Tprog = N * Tinst

= N * CPI * ΔtN : Number of instructionsCPI : Cycles per instruction(Av)Δt : Clock cycle time

IF D RF EX/AG M WB

Instruction execution time = Tinst = CPI* ΔtΔt

Anshul Kumar, CSE IITD slide 4

What influences clock period?What influences clock period?What influences clock period?

Tprog = N * CPI * ΔtTechnology - Δt ⇓

Software - N ⇓

Architecture - N * CPI * Δt ⇓Instruction set architecture (ISA)

trade-off N vs CPI * ΔtMicro architecture (μA)

trade-off CPI vs Δt

Anshul Kumar, CSE IITD slide 5

Relative performance per unit costRelative performance per unit costRelative performance per unit cost

Year Technology Perf/cost1951 Vacuum tube 11965 Transistor 351975 Integrated circuit 9001995 VLSI 2,400,000

Anshul Kumar, CSE IITD slide 6

Increase in workstation performanceIncrease in workstation performanceIncrease in workstation performance

HP 9000/750

SUN-4/ 260

MIPS M2000

MIPS M/120

IBM RS6000

100

200

300

400

500

600

700

800

900

1100

DEC Alpha 5/500

DEC Alpha 21264/600

DEC Alpha 5/300

DEC Alpha 4/266

DEC AXP/500IBM POWER 100

Year

Per

form

ance

0

1000

1200

19971996199519941993199219911990198919881987

Anshul Kumar, CSE IITD slide 7

Growth in DRAM CapacityGrowth in DRAM CapacityGrowth in DRAM Capacity

1992

100,000

Kbi

t cap

acity

10,000

1000

100

1019901988198619841982198019781976

Year of introduction

16M

4M

1M

256K

16K

64K

1994 1996

64M

Anshul Kumar, CSE IITD slide 8

CPU-Memory Performance GapCPUCPU--Memory Performance GapMemory Performance Gap• Semiconductor

– Registers CPU speed– SRAM Random Access– DRAM– FLASH

• Magnetic Slow– FDD– HDD

• Optical Random + sequential– CD Very slow– DVD

Anshul Kumar, CSE IITD slide 9

Memory Hierarchy PrincipleMemory Hierarchy PrincipleMemory Hierarchy Principle

Memory

CPU

Memory

Size Cost / bitSpeed

Smallest

Biggest

Highest

Lowest

Fastest

Slowest Memory

access

hit

miss

Temporal Locality– References repeated in

timeSpatial Locality

– References repeated in space

– Special case: Sequential Locality

Anshul Kumar, CSE IITD slide 10

Parallelism : Flynn’s ClassificationParallelism : FlynnParallelism : Flynn’’ss ClassificationClassification

Architecture Categories

SISD SIMD MISD MIMD

Anshul Kumar, CSE IITD slide 11

SISDSISDSISD

C P MIS IS DS

Anshul Kumar, CSE IITD slide 12

SIMDSIMDSIMD

C

P

P

MIS

DS

DS

Anshul Kumar, CSE IITD slide 13

MISDMISDMISD

C

C

P

P

M

IS

IS

IS

IS

DS

DS

Anshul Kumar, CSE IITD slide 14

MIMDMIMDMIMD

C

C

P

P

M

IS

IS

IS

IS

DS

DS

Anshul Kumar, CSE IITD slide 15

Feng’s ClassificationFengFeng’’ss ClassificationClassification

1 16 32 641

16

64

256

16K

word length

bit slicelength

•MPP

•STARAN

•C.mmP

•PDP11

•PEPE

•IBM370

•IlliacIV

•CRAY-1

Anshul Kumar, CSE IITD slide 16

Händler’s ClassificationHHäändlerndler’’ss ClassificationClassification

< K x K’ , D x D’ , W x W’ >control data word

dash → degree of pipeliningTI - ASC <1, 4, 64 x 8>CDC 6600 <1, 1 x 10, 60> x <10, 1, 12> (I/O)C.mmP <16,1,16> + <1x16,1,16> + <1,16,16>PEPE <1 x 3, 288, 32>Cray-1 <1, 12 x 8, 64 x (1 ~ 14)>

Anshul Kumar, CSE IITD slide 17

Modern ClassificationModern ClassificationModern Classification

Parallel architectures

Data-parallel

architectures

Function-parallel

architectures

Anshul Kumar, CSE IITD slide 18

Data Parallel ArchitecturesData Parallel ArchitecturesData Parallel Architectures

• SIMD Processors– Multiple processing elements driven by a single

instruction stream• Vector Processors

– Uni-processors with vector instructions• Associative Processors

– SIMD like processors with associative memory• Systolic Arrays

– Application specific VLSI structures

Anshul Kumar, CSE IITD slide 19

Function Parallel ArchitecturesFunction Parallel ArchitecturesFunction Parallel Architectures

Function-parallel architectures

Instr level Parallel Arch

Thread level Parallel Arch

Process level Parallel Arch

(ILPs) (MIMDs)

Pipelined processors

VLIWs Superscalar processors

Distributed Memory

MIMD

Shared Memory

MIMD

Anshul Kumar, CSE IITD slide 20

PipeliningPipeliningPipelining

IF D RF EX/AG M WB

• faster throughput with pipelining

Simple multicycle design :•resource sharing across cycles • all instructions may not take same cycles

Anshul Kumar, CSE IITD slide 21

Limits of PipeliningLimits of PipeliningLimits of Pipelining• Structural hazards

– Resource conflicts - two instruction require same resource in the same cycle

• Data hazards– Data dependencies - one instruction needs data

which is yet to be produced by another instruction

• Control Hazards– Decision about next instruction needs more

cycles

Anshul Kumar, CSE IITD slide 22

Cache/

memoryFetch

Unit Single multi-operation instruction

multi-operation instruction

FU FU FU

Register file

ILP in VLIW processorsILP in VLIW processorsILP in VLIW processors

Anshul Kumar, CSE IITD slide 23

Cache/

memoryFetch

UnitMultiple instruction

Sequential stream of instructions

FU FU FU

Register file

Decode

and issue

unit

Instruction/controlData

FU Funtional Unit

ILP in Superscalar processorsILP in Superscalar processorsILP in Superscalar processors

Anshul Kumar, CSE IITD slide 24

Superscalar and VLIW processors Superscalar and VLIW processors Superscalar and VLIW processors

Anshul Kumar, CSE IITD slide 25

FU FU FU

Register file

•Scalability with increase in number of register ports

•ILP detection – special compilers / special hardware

•Code compatibility

•Code density, Instruction encoding

•Maintaining consistency

Issues in ILP ArchitecturesIssues in Issues in ILPILP ArchitecturesArchitectures

Anshul Kumar, CSE IITD slide 26

ILP and MultithreadingILP and MultithreadingILP and MultithreadingILP Coarse MT Fine MT SMT

Hen

ness

y an

d Pa

tters

on

Anshul Kumar, CSE IITD slide 27

Why Process level Parallel Architectures?Why Process level Parallel Architectures?Why Process level Parallel Architectures?

Function-parallel architectures

Instruction level PAs

Thread level PAs

Process level PAs(MIMDs)

Distributed Memory

MIMD

Shared Memory

MIMD

Data-parallel architectures

Built usinggeneral purpose

processors

Anshul Kumar, CSE IITD slide 28

Issues from user’s perspectiveIssues from userIssues from user’’s perspectives perspective

• Specification / Program design– explicit parallelism or – implicit parallelism + parallelizing compiler

• Partitioning / mapping to processors• Scheduling / mapping to time instants

– static or dynamic• Communication and Synchronization

Anshul Kumar, CSE IITD slide 29

Parallel programming modelsParallel programming modelsParallel programming models

Concurrent control flow

Functional or logic program

Vector/array operations

Concurrent tasks/processes/threads/objects

With shared variables or message passing

Relationship between programming model and architecture ?

Anshul Kumar, CSE IITD slide 30

Issues from architect’s perspectiveIssues from architectIssues from architect’’s perspectives perspective

• Coherence problem in shared memory with caches

• Efficient interconnection networks

Anshul Kumar, CSE IITD slide 31

Shared Memory MultiprocessorShared Memory MultiprocessorShared Memory Multiprocessor

P P P P

M M M

Interconnection Network

M

M M M

P P P P

M M M

Interconnection Network

M

M M M

Global Interconnection Network

M M M

Anshul Kumar, CSE IITD slide 32

Cache Coherence ProblemCache Coherence ProblemCache Coherence Problem

Multiple copies of data may exist⇒ Problem of cache coherenceOptions for coherence protocols• What action is taken?

– Invalidate or Update• Which processors/caches communicate?

– Snoopy (broadcast) or directory based• Status of each block?

Anshul Kumar, CSE IITD slide 33

Interconnection NetworksInterconnection NetworksInterconnection Networks

• Architectural Variations:– Topology– Direct or Indirect (through switches)– Static (fixed connections) or Dynamic (connections

established as required)– Routing type store and forward/worm hole)

• Efficiency:– Delay– Bandwidth– Cost

Anshul Kumar, CSE IITD slide 34

Quest for PerformanceQuest for PerformanceQuest for Performance1946 ENIAC ($0.5 M, 18K VTs, 150 kW)add/sub 5000 per secmult 385 per secdiv 40 per secsqrt 3 per sec

1962 Atlas (Pipelined, Int + FPU)200K FLOPs

1962 Burroughs D825 (4 CPUs 16 Mem)

1964 CDC 6600 (first supercomputer)multiple FUs, dynamic scheduling

1972 ILLIAC-IV (64 PEs, 4 MFLOPs each)

Anshul Kumar, CSE IITD slide 35

Fastest Supercomputer (ref www.top500.org)

Fastest Supercomputer Fastest Supercomputer (ref www.top500.org)(ref www.top500.org)

• IBM’s Blue Gene/L at Lawrence Livermore Lab topped in June 2006 with 280.6 teraflops

• Japan’s Earth simulator introduced in 2002 was fastest with 35.8 teraflops till Blue Gene took over in 2004.

• Japan’s proposal (2005) to build a supercomputer 73 times faster than the current best. Target: 10 petaflops, budget $800 - $900 million, date 2011.

• Tata sons’ EKA entered 4th spot in 2007 with 132.8 teraflops

• Energy efficiency (max 488 mflopr/watt) also listed in June 2008

June 2008 listJune 2008 listJune 2008 list

EKA - Cluster Platform 3000 BL460c, Xeon 53xx 3GHz, Infiniband, HP (133 teraflops)

Computational Research Laboratories, TATA SONS India8

Encanto - SGI Altix ICE 8200, Xeon quad core 3.0 GHz, SGI

New Mexico Computing Applications Center (NMCAC) United States

7

JUGENE - Blue Gene/P Solution, IBMForschungszentrum Juelich (FZJ) 6

Jaguar - Cray XT4 QuadCore 2.1 GHz, Cray Inc.DOE/Oak Ridge National Laboratory United States5

Ranger - SunBlade x6420, Opteron Quad 2Ghz, Infiniband, Sun Microsystems

Texas Advanced Computing Center/Univ. of Texas United States

4

Blue Gene/P Solution, IBMArgonne National Laboratory United States3

BlueGene/L - eServer Blue Gene Solution, IBMDOE/NNSA/LLNL United States2

Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz , Voltaire Infiniband, IBM (1026 teraflops)

DOE/NNSA/LANL United States1

ComputerSiteRank

Anshul Kumar, CSE IITD slide 37

Blue Gene SupercomputerBlue Gene SupercomputerBlue Gene Supercomputer• 32 x 32 x 64 3D torus (65,536 nodes)• Global reduction tree - max/sum in a

few μs• Fast synch across entire machine within

a few μs• 1,024 gbps links to a global parallel file

system

Anshul Kumar, CSE IITD slide 38

Blue Gene Supercomputer contd.Blue Gene Supercomputer contd.Blue Gene Supercomputer contd.

Anshul Kumar, CSE IITD slide 39

Embedded vs GP ComputingEmbedded vs GP ComputingEmbedded vs GP Computing

• Fixed functionality• Part of a larger system• Interact with environment• Real-time requirements• Power constraints• Environmental contraints

• Performance can not be increased simply by increasing clock frequency

Anshul Kumar, CSE IITD slide 40

Cradle CT 3616 ArchitectureCradle CT 3616 ArchitectureCradle CT 3616 Architecture

Anshul Kumar, CSE IITD slide 41

IBM Cell Architecture

IBM Cell IBM Cell ArchitectureArchitecture

• Clock speed: > 4 GHz • Peak performance (single

precision): > 256 GFlops • Peak performance

(double precision): >26 GFlops

• SPU registers 128 x 128b• Local storage size per

SPU: 256KB • Area: 221 mm²• Technology 90nm SOI • Total number of

transistors: 234M

Anshul Kumar, CSE IITD slide 42

BooksBooksBooks1. D.A. Patterson, J.L. Hennessy, "Computer Architecture : A

Quantitative Approach", Morgan Kaufmann Publishers, 2006. 2. D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer

Architectures : A Design Space Approach", Addison Wesley, 1997.

3. M.J. Flynn, "Computer Architecture : Pipelined and Parallel Processor Design", Narosa Publishing House/ Jones and Bartlett, 1996.

4. K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993.

5. H.G. Cragon, "Memory Systems and Pipelined Processors", Narosa Publishing House/ Jones and Bartlett, 1998.

6. D.E. Culler, J.P Singh and Anoop Gupta, "Parallel Computer Architecture, A Hardware/Software Approach", Harcourt Asia / Morgan Kaufmann Publishers, 2000.

top related