lec jan12 2009

42
Introduction 12th January, 2009 CSL718 : Architecture of High Performance Systems CSL718 : Architecture of CSL718 : Architecture of High Performance Systems High Performance Systems

Upload: ravi-soni

Post on 11-May-2015

1.789 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Lec Jan12 2009

Introduction12th January, 2009

CSL718 : Architecture of High Performance Systems CSL718 : Architecture of CSL718 : Architecture of

High Performance SystemsHigh Performance Systems

Page 2: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 2

Some basic questionsSome basic questionsSome basic questions

• What is high performance?

• Who needs high performance systems?

• How do you achieve high performance?

• How to analyse or evaluate performance?

– Rate of computation– Time to compute– Weather prediction,

complex design, scientific computation etc.

– Every one needs it.– Technology– Circuit / logic design– Architecture– Theoretical models– Simulation– Experimentation

Page 3: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 3

Execution Time and Clock PeriodExecution Time and Clock PeriodExecution Time and Clock Period

Program exec time = Tprog = N * Tinst

= N * CPI * ΔtN : Number of instructionsCPI : Cycles per instruction(Av)Δt : Clock cycle time

IF D RF EX/AG M WB

Instruction execution time = Tinst = CPI* ΔtΔt

Page 4: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 4

What influences clock period?What influences clock period?What influences clock period?

Tprog = N * CPI * ΔtTechnology - Δt ⇓

Software - N ⇓

Architecture - N * CPI * Δt ⇓Instruction set architecture (ISA)

trade-off N vs CPI * ΔtMicro architecture (μA)

trade-off CPI vs Δt

Page 5: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 5

Relative performance per unit costRelative performance per unit costRelative performance per unit cost

Year Technology Perf/cost1951 Vacuum tube 11965 Transistor 351975 Integrated circuit 9001995 VLSI 2,400,000

Page 6: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 6

Increase in workstation performanceIncrease in workstation performanceIncrease in workstation performance

HP 9000/750

SUN-4/ 260

MIPS M2000

MIPS M/120

IBM RS6000

100

200

300

400

500

600

700

800

900

1100

DEC Alpha 5/500

DEC Alpha 21264/600

DEC Alpha 5/300

DEC Alpha 4/266

DEC AXP/500IBM POWER 100

Year

Per

form

ance

0

1000

1200

19971996199519941993199219911990198919881987

Page 7: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 7

Growth in DRAM CapacityGrowth in DRAM CapacityGrowth in DRAM Capacity

1992

100,000

Kbi

t cap

acity

10,000

1000

100

1019901988198619841982198019781976

Year of introduction

16M

4M

1M

256K

16K

64K

1994 1996

64M

Page 8: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 8

CPU-Memory Performance GapCPUCPU--Memory Performance GapMemory Performance Gap• Semiconductor

– Registers CPU speed– SRAM Random Access– DRAM– FLASH

• Magnetic Slow– FDD– HDD

• Optical Random + sequential– CD Very slow– DVD

Page 9: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 9

Memory Hierarchy PrincipleMemory Hierarchy PrincipleMemory Hierarchy Principle

Memory

CPU

Memory

Size Cost / bitSpeed

Smallest

Biggest

Highest

Lowest

Fastest

Slowest Memory

access

hit

miss

Temporal Locality– References repeated in

timeSpatial Locality

– References repeated in space

– Special case: Sequential Locality

Page 10: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 10

Parallelism : Flynn’s ClassificationParallelism : FlynnParallelism : Flynn’’ss ClassificationClassification

Architecture Categories

SISD SIMD MISD MIMD

Page 11: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 11

SISDSISDSISD

C P MIS IS DS

Page 12: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 12

SIMDSIMDSIMD

C

P

P

MIS

DS

DS

Page 13: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 13

MISDMISDMISD

C

C

P

P

M

IS

IS

IS

IS

DS

DS

Page 14: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 14

MIMDMIMDMIMD

C

C

P

P

M

IS

IS

IS

IS

DS

DS

Page 15: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 15

Feng’s ClassificationFengFeng’’ss ClassificationClassification

1 16 32 641

16

64

256

16K

word length

bit slicelength

•MPP

•STARAN

•C.mmP

•PDP11

•PEPE

•IBM370

•IlliacIV

•CRAY-1

Page 16: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 16

Händler’s ClassificationHHäändlerndler’’ss ClassificationClassification

< K x K’ , D x D’ , W x W’ >control data word

dash → degree of pipeliningTI - ASC <1, 4, 64 x 8>CDC 6600 <1, 1 x 10, 60> x <10, 1, 12> (I/O)C.mmP <16,1,16> + <1x16,1,16> + <1,16,16>PEPE <1 x 3, 288, 32>Cray-1 <1, 12 x 8, 64 x (1 ~ 14)>

Page 17: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 17

Modern ClassificationModern ClassificationModern Classification

Parallel architectures

Data-parallel

architectures

Function-parallel

architectures

Page 18: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 18

Data Parallel ArchitecturesData Parallel ArchitecturesData Parallel Architectures

• SIMD Processors– Multiple processing elements driven by a single

instruction stream• Vector Processors

– Uni-processors with vector instructions• Associative Processors

– SIMD like processors with associative memory• Systolic Arrays

– Application specific VLSI structures

Page 19: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 19

Function Parallel ArchitecturesFunction Parallel ArchitecturesFunction Parallel Architectures

Function-parallel architectures

Instr level Parallel Arch

Thread level Parallel Arch

Process level Parallel Arch

(ILPs) (MIMDs)

Pipelined processors

VLIWs Superscalar processors

Distributed Memory

MIMD

Shared Memory

MIMD

Page 20: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 20

PipeliningPipeliningPipelining

IF D RF EX/AG M WB

• faster throughput with pipelining

Simple multicycle design :•resource sharing across cycles • all instructions may not take same cycles

Page 21: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 21

Limits of PipeliningLimits of PipeliningLimits of Pipelining• Structural hazards

– Resource conflicts - two instruction require same resource in the same cycle

• Data hazards– Data dependencies - one instruction needs data

which is yet to be produced by another instruction

• Control Hazards– Decision about next instruction needs more

cycles

Page 22: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 22

Cache/

memoryFetch

Unit Single multi-operation instruction

multi-operation instruction

FU FU FU

Register file

ILP in VLIW processorsILP in VLIW processorsILP in VLIW processors

Page 23: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 23

Cache/

memoryFetch

UnitMultiple instruction

Sequential stream of instructions

FU FU FU

Register file

Decode

and issue

unit

Instruction/controlData

FU Funtional Unit

ILP in Superscalar processorsILP in Superscalar processorsILP in Superscalar processors

Page 24: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 24

Superscalar and VLIW processors Superscalar and VLIW processors Superscalar and VLIW processors

Page 25: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 25

FU FU FU

Register file

•Scalability with increase in number of register ports

•ILP detection – special compilers / special hardware

•Code compatibility

•Code density, Instruction encoding

•Maintaining consistency

Issues in ILP ArchitecturesIssues in Issues in ILPILP ArchitecturesArchitectures

Page 26: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 26

ILP and MultithreadingILP and MultithreadingILP and MultithreadingILP Coarse MT Fine MT SMT

Hen

ness

y an

d Pa

tters

on

Page 27: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 27

Why Process level Parallel Architectures?Why Process level Parallel Architectures?Why Process level Parallel Architectures?

Function-parallel architectures

Instruction level PAs

Thread level PAs

Process level PAs(MIMDs)

Distributed Memory

MIMD

Shared Memory

MIMD

Data-parallel architectures

Built usinggeneral purpose

processors

Page 28: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 28

Issues from user’s perspectiveIssues from userIssues from user’’s perspectives perspective

• Specification / Program design– explicit parallelism or – implicit parallelism + parallelizing compiler

• Partitioning / mapping to processors• Scheduling / mapping to time instants

– static or dynamic• Communication and Synchronization

Page 29: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 29

Parallel programming modelsParallel programming modelsParallel programming models

Concurrent control flow

Functional or logic program

Vector/array operations

Concurrent tasks/processes/threads/objects

With shared variables or message passing

Relationship between programming model and architecture ?

Page 30: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 30

Issues from architect’s perspectiveIssues from architectIssues from architect’’s perspectives perspective

• Coherence problem in shared memory with caches

• Efficient interconnection networks

Page 31: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 31

Shared Memory MultiprocessorShared Memory MultiprocessorShared Memory Multiprocessor

P P P P

M M M

Interconnection Network

M

M M M

P P P P

M M M

Interconnection Network

M

M M M

Global Interconnection Network

M M M

Page 32: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 32

Cache Coherence ProblemCache Coherence ProblemCache Coherence Problem

Multiple copies of data may exist⇒ Problem of cache coherenceOptions for coherence protocols• What action is taken?

– Invalidate or Update• Which processors/caches communicate?

– Snoopy (broadcast) or directory based• Status of each block?

Page 33: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 33

Interconnection NetworksInterconnection NetworksInterconnection Networks

• Architectural Variations:– Topology– Direct or Indirect (through switches)– Static (fixed connections) or Dynamic (connections

established as required)– Routing type store and forward/worm hole)

• Efficiency:– Delay– Bandwidth– Cost

Page 34: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 34

Quest for PerformanceQuest for PerformanceQuest for Performance1946 ENIAC ($0.5 M, 18K VTs, 150 kW)add/sub 5000 per secmult 385 per secdiv 40 per secsqrt 3 per sec

1962 Atlas (Pipelined, Int + FPU)200K FLOPs

1962 Burroughs D825 (4 CPUs 16 Mem)

1964 CDC 6600 (first supercomputer)multiple FUs, dynamic scheduling

1972 ILLIAC-IV (64 PEs, 4 MFLOPs each)

Page 35: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 35

Fastest Supercomputer (ref www.top500.org)

Fastest Supercomputer Fastest Supercomputer (ref www.top500.org)(ref www.top500.org)

• IBM’s Blue Gene/L at Lawrence Livermore Lab topped in June 2006 with 280.6 teraflops

• Japan’s Earth simulator introduced in 2002 was fastest with 35.8 teraflops till Blue Gene took over in 2004.

• Japan’s proposal (2005) to build a supercomputer 73 times faster than the current best. Target: 10 petaflops, budget $800 - $900 million, date 2011.

• Tata sons’ EKA entered 4th spot in 2007 with 132.8 teraflops

• Energy efficiency (max 488 mflopr/watt) also listed in June 2008

Page 36: Lec Jan12 2009

June 2008 listJune 2008 listJune 2008 list

EKA - Cluster Platform 3000 BL460c, Xeon 53xx 3GHz, Infiniband, HP (133 teraflops)

Computational Research Laboratories, TATA SONS India8

Encanto - SGI Altix ICE 8200, Xeon quad core 3.0 GHz, SGI

New Mexico Computing Applications Center (NMCAC) United States

7

JUGENE - Blue Gene/P Solution, IBMForschungszentrum Juelich (FZJ) 6

Jaguar - Cray XT4 QuadCore 2.1 GHz, Cray Inc.DOE/Oak Ridge National Laboratory United States5

Ranger - SunBlade x6420, Opteron Quad 2Ghz, Infiniband, Sun Microsystems

Texas Advanced Computing Center/Univ. of Texas United States

4

Blue Gene/P Solution, IBMArgonne National Laboratory United States3

BlueGene/L - eServer Blue Gene Solution, IBMDOE/NNSA/LLNL United States2

Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz , Voltaire Infiniband, IBM (1026 teraflops)

DOE/NNSA/LANL United States1

ComputerSiteRank

Page 37: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 37

Blue Gene SupercomputerBlue Gene SupercomputerBlue Gene Supercomputer• 32 x 32 x 64 3D torus (65,536 nodes)• Global reduction tree - max/sum in a

few μs• Fast synch across entire machine within

a few μs• 1,024 gbps links to a global parallel file

system

Page 38: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 38

Blue Gene Supercomputer contd.Blue Gene Supercomputer contd.Blue Gene Supercomputer contd.

Page 39: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 39

Embedded vs GP ComputingEmbedded vs GP ComputingEmbedded vs GP Computing

• Fixed functionality• Part of a larger system• Interact with environment• Real-time requirements• Power constraints• Environmental contraints

• Performance can not be increased simply by increasing clock frequency

Page 40: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 40

Cradle CT 3616 ArchitectureCradle CT 3616 ArchitectureCradle CT 3616 Architecture

Page 41: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 41

IBM Cell Architecture

IBM Cell IBM Cell ArchitectureArchitecture

• Clock speed: > 4 GHz • Peak performance (single

precision): > 256 GFlops • Peak performance

(double precision): >26 GFlops

• SPU registers 128 x 128b• Local storage size per

SPU: 256KB • Area: 221 mm²• Technology 90nm SOI • Total number of

transistors: 234M

Page 42: Lec Jan12 2009

Anshul Kumar, CSE IITD slide 42

BooksBooksBooks1. D.A. Patterson, J.L. Hennessy, "Computer Architecture : A

Quantitative Approach", Morgan Kaufmann Publishers, 2006. 2. D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer

Architectures : A Design Space Approach", Addison Wesley, 1997.

3. M.J. Flynn, "Computer Architecture : Pipelined and Parallel Processor Design", Narosa Publishing House/ Jones and Bartlett, 1996.

4. K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993.

5. H.G. Cragon, "Memory Systems and Pipelined Processors", Narosa Publishing House/ Jones and Bartlett, 1998.

6. D.E. Culler, J.P Singh and Anoop Gupta, "Parallel Computer Architecture, A Hardware/Software Approach", Harcourt Asia / Morgan Kaufmann Publishers, 2000.