lec jan12 2009

Introduction12th January, 2009

CSL718 : Architecture of High Performance Systems CSL718 : Architecture of CSL718 : Architecture of

High Performance SystemsHigh Performance Systems

Anshul Kumar, CSE IITD slide 2

Some basic questionsSome basic questionsSome basic questions

• What is high performance?

• Who needs high performance systems?

• How do you achieve high performance?

• How to analyse or evaluate performance?

– Rate of computation– Time to compute– Weather prediction,

complex design, scientific computation etc.

– Every one needs it.– Technology– Circuit / logic design– Architecture– Theoretical models– Simulation– Experimentation

Execution Time and Clock PeriodExecution Time and Clock PeriodExecution Time and Clock Period

Program exec time = Tprog = N * Tinst

= N * CPI * ΔtN : Number of instructionsCPI : Cycles per instruction(Av)Δt : Clock cycle time

IF D RF EX/AG M WB

Instruction execution time = Tinst = CPI* ΔtΔt

What influences clock period?What influences clock period?What influences clock period?

Tprog = N * CPI * ΔtTechnology - Δt ⇓

Software - N ⇓

Architecture - N * CPI * Δt ⇓Instruction set architecture (ISA)

trade-off N vs CPI * ΔtMicro architecture (μA)

trade-off CPI vs Δt

Relative performance per unit costRelative performance per unit costRelative performance per unit cost

Year Technology Perf/cost1951 Vacuum tube 11965 Transistor 351975 Integrated circuit 9001995 VLSI 2,400,000

Increase in workstation performanceIncrease in workstation performanceIncrease in workstation performance

HP 9000/750

SUN-4/ 260

MIPS M2000

MIPS M/120

IBM RS6000

DEC Alpha 5/500

DEC Alpha 21264/600

DEC Alpha 5/300

DEC Alpha 4/266

DEC AXP/500IBM POWER 100

19971996199519941993199219911990198919881987

Growth in DRAM CapacityGrowth in DRAM CapacityGrowth in DRAM Capacity

100,000

10,000

1019901988198619841982198019781976

Year of introduction

1994 1996

CPU-Memory Performance GapCPUCPU--Memory Performance GapMemory Performance Gap• Semiconductor

– Registers CPU speed– SRAM Random Access– DRAM– FLASH

• Magnetic Slow– FDD– HDD

• Optical Random + sequential– CD Very slow– DVD

Memory Hierarchy PrincipleMemory Hierarchy PrincipleMemory Hierarchy Principle

Memory

Size Cost / bitSpeed

Smallest

Biggest

Highest

Lowest

Fastest

Slowest Memory

access

Temporal Locality– References repeated in

timeSpatial Locality

– References repeated in space

– Special case: Sequential Locality

Parallelism : Flynn’s ClassificationParallelism : FlynnParallelism : Flynn’’ss ClassificationClassification

Architecture Categories

SISD SIMD MISD MIMD

SISDSISDSISD

C P MIS IS DS

SIMDSIMDSIMD

MISDMISDMISD

MIMDMIMDMIMD

Feng’s ClassificationFengFeng’’ss ClassificationClassification

1 16 32 641

word length

bit slicelength

•MPP

•STARAN

•C.mmP

•PDP11

•PEPE

•IBM370

•IlliacIV

•CRAY-1

Händler’s ClassificationHHäändlerndler’’ss ClassificationClassification

< K x K’ , D x D’ , W x W’ >control data word

dash → degree of pipeliningTI - ASC <1, 4, 64 x 8>CDC 6600 <1, 1 x 10, 60> x <10, 1, 12> (I/O)C.mmP <16,1,16> + <1x16,1,16> + <1,16,16>PEPE <1 x 3, 288, 32>Cray-1 <1, 12 x 8, 64 x (1 ~ 14)>

Modern ClassificationModern ClassificationModern Classification

Parallel architectures

Data-parallel

architectures

Function-parallel

architectures

Data Parallel ArchitecturesData Parallel ArchitecturesData Parallel Architectures

• SIMD Processors– Multiple processing elements driven by a single

instruction stream• Vector Processors

– Uni-processors with vector instructions• Associative Processors

– SIMD like processors with associative memory• Systolic Arrays

– Application specific VLSI structures

Function Parallel ArchitecturesFunction Parallel ArchitecturesFunction Parallel Architectures

Function-parallel architectures

Instr level Parallel Arch

Thread level Parallel Arch

Process level Parallel Arch

(ILPs) (MIMDs)

Pipelined processors

VLIWs Superscalar processors

Distributed Memory

Shared Memory

PipeliningPipeliningPipelining

IF D RF EX/AG M WB

• faster throughput with pipelining

Simple multicycle design :•resource sharing across cycles • all instructions may not take same cycles

Limits of PipeliningLimits of PipeliningLimits of Pipelining• Structural hazards

– Resource conflicts - two instruction require same resource in the same cycle

• Data hazards– Data dependencies - one instruction needs data

which is yet to be produced by another instruction

• Control Hazards– Decision about next instruction needs more

cycles

Cache/

memoryFetch

Unit Single multi-operation instruction

multi-operation instruction

FU FU FU

Register file

ILP in VLIW processorsILP in VLIW processorsILP in VLIW processors

Cache/

memoryFetch

UnitMultiple instruction

Sequential stream of instructions

FU FU FU

Register file

Decode

and issue

Instruction/controlData

FU Funtional Unit

ILP in Superscalar processorsILP in Superscalar processorsILP in Superscalar processors

Superscalar and VLIW processors Superscalar and VLIW processors Superscalar and VLIW processors

FU FU FU

Register file

•Scalability with increase in number of register ports

•ILP detection – special compilers / special hardware

•Code compatibility

•Code density, Instruction encoding

•Maintaining consistency

Issues in ILP ArchitecturesIssues in Issues in ILPILP ArchitecturesArchitectures

ILP and MultithreadingILP and MultithreadingILP and MultithreadingILP Coarse MT Fine MT SMT

Why Process level Parallel Architectures?Why Process level Parallel Architectures?Why Process level Parallel Architectures?

Function-parallel architectures

Instruction level PAs

Thread level PAs

Process level PAs(MIMDs)

Distributed Memory

Shared Memory

Data-parallel architectures

Built usinggeneral purpose

processors

Issues from user’s perspectiveIssues from userIssues from user’’s perspectives perspective

• Specification / Program design– explicit parallelism or – implicit parallelism + parallelizing compiler

• Partitioning / mapping to processors• Scheduling / mapping to time instants

– static or dynamic• Communication and Synchronization

Parallel programming modelsParallel programming modelsParallel programming models

Concurrent control flow

Functional or logic program

Vector/array operations

Concurrent tasks/processes/threads/objects

With shared variables or message passing

Relationship between programming model and architecture ?

Issues from architect’s perspectiveIssues from architectIssues from architect’’s perspectives perspective

• Coherence problem in shared memory with caches

• Efficient interconnection networks

Shared Memory MultiprocessorShared Memory MultiprocessorShared Memory Multiprocessor

P P P P

Interconnection Network

P P P P

Interconnection Network

Global Interconnection Network

Cache Coherence ProblemCache Coherence ProblemCache Coherence Problem

Multiple copies of data may exist⇒ Problem of cache coherenceOptions for coherence protocols• What action is taken?

– Invalidate or Update• Which processors/caches communicate?

– Snoopy (broadcast) or directory based• Status of each block?

Interconnection NetworksInterconnection NetworksInterconnection Networks

• Architectural Variations:– Topology– Direct or Indirect (through switches)– Static (fixed connections) or Dynamic (connections

established as required)– Routing type store and forward/worm hole)

• Efficiency:– Delay– Bandwidth– Cost

Quest for PerformanceQuest for PerformanceQuest for Performance1946 ENIAC ($0.5 M, 18K VTs, 150 kW)add/sub 5000 per secmult 385 per secdiv 40 per secsqrt 3 per sec

1962 Atlas (Pipelined, Int + FPU)200K FLOPs

1962 Burroughs D825 (4 CPUs 16 Mem)

1964 CDC 6600 (first supercomputer)multiple FUs, dynamic scheduling

1972 ILLIAC-IV (64 PEs, 4 MFLOPs each)

Fastest Supercomputer (ref www.top500.org)

Fastest Supercomputer Fastest Supercomputer (ref www.top500.org)(ref www.top500.org)

• IBM’s Blue Gene/L at Lawrence Livermore Lab topped in June 2006 with 280.6 teraflops

• Japan’s Earth simulator introduced in 2002 was fastest with 35.8 teraflops till Blue Gene took over in 2004.

• Japan’s proposal (2005) to build a supercomputer 73 times faster than the current best. Target: 10 petaflops, budget $800 - $900 million, date 2011.

• Tata sons’ EKA entered 4th spot in 2007 with 132.8 teraflops

• Energy efficiency (max 488 mflopr/watt) also listed in June 2008

June 2008 listJune 2008 listJune 2008 list

EKA - Cluster Platform 3000 BL460c, Xeon 53xx 3GHz, Infiniband, HP (133 teraflops)

Computational Research Laboratories, TATA SONS India8

Encanto - SGI Altix ICE 8200, Xeon quad core 3.0 GHz, SGI

New Mexico Computing Applications Center (NMCAC) United States

JUGENE - Blue Gene/P Solution, IBMForschungszentrum Juelich (FZJ) 6

Jaguar - Cray XT4 QuadCore 2.1 GHz, Cray Inc.DOE/Oak Ridge National Laboratory United States5

Ranger - SunBlade x6420, Opteron Quad 2Ghz, Infiniband, Sun Microsystems

Texas Advanced Computing Center/Univ. of Texas United States

Blue Gene/P Solution, IBMArgonne National Laboratory United States3

BlueGene/L - eServer Blue Gene Solution, IBMDOE/NNSA/LLNL United States2

Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz , Voltaire Infiniband, IBM (1026 teraflops)

DOE/NNSA/LANL United States1

ComputerSiteRank

Blue Gene SupercomputerBlue Gene SupercomputerBlue Gene Supercomputer• 32 x 32 x 64 3D torus (65,536 nodes)• Global reduction tree - max/sum in a

few μs• Fast synch across entire machine within

a few μs• 1,024 gbps links to a global parallel file

system

Blue Gene Supercomputer contd.Blue Gene Supercomputer contd.Blue Gene Supercomputer contd.

Embedded vs GP ComputingEmbedded vs GP ComputingEmbedded vs GP Computing

• Fixed functionality• Part of a larger system• Interact with environment• Real-time requirements• Power constraints• Environmental contraints

• Performance can not be increased simply by increasing clock frequency

Cradle CT 3616 ArchitectureCradle CT 3616 ArchitectureCradle CT 3616 Architecture

IBM Cell Architecture

IBM Cell IBM Cell ArchitectureArchitecture

• Clock speed: > 4 GHz • Peak performance (single

precision): > 256 GFlops • Peak performance

(double precision): >26 GFlops

• SPU registers 128 x 128b• Local storage size per

SPU: 256KB • Area: 221 mm²• Technology 90nm SOI • Total number of

transistors: 234M

BooksBooksBooks1. D.A. Patterson, J.L. Hennessy, "Computer Architecture : A

Quantitative Approach", Morgan Kaufmann Publishers, 2006. 2. D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer

Architectures : A Design Space Approach", Addison Wesley, 1997.

3. M.J. Flynn, "Computer Architecture : Pipelined and Parallel Processor Design", Narosa Publishing House/ Jones and Bartlett, 1996.

4. K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993.

5. H.G. Cragon, "Memory Systems and Pipelined Processors", Narosa Publishing House/ Jones and Bartlett, 1998.

6. D.E. Culler, J.P Singh and Anoop Gupta, "Parallel Computer Architecture, A Hardware/Software Approach", Harcourt Asia / Morgan Kaufmann Publishers, 2000.

lec jan12 2009

anshul kumar

cse iitd

t instruction

morecycles slide

simddspismcdsp slide

clock cycle time slide

technology high performance

instructionav t

Technology

εικονικεσ κοινωνιεσ jan12

rossington today jan12

gravitas jan12 lr

lec jan15 2009

horizons dec-jan12

lec jan19 2009

mxt_enf suprema_ jan12

lec feb02 2009

kaim jan12

lec 16 & 17 study guide 2009

lec an2 sem2 trantescu 2009

torts jan12

new tsrgd jan12

jan12 denver final

lec feb05 2009

sas catalogue jan12

lifeline jan12

anatomy 1 st lec 2009

lecture jan7 & jan12

196 jan12, 2011