center for domaincenter for domain--specific computing...

Center for DomainCenter for Domain--Specific Computing Specific Computing Supported by NSF “Expedition in Computing” ProgramSupported by NSF “Expedition in Computing” ProgramSupported by NSF Expedition in Computing ProgramSupported by NSF Expedition in Computing Program

www.cdsc.ucla.edu.www.cdsc.ucla.edu.

Di t P f J CDi t P f J CDirector: Prof. Jason CongDirector: Prof. Jason [email protected]@cs.ucla.edu

Participating Universities:Participating Universities:UCLA (lead), Rice, OhioUCLA (lead), Rice, Ohio--State, and UC Santa BarbaraState, and UC Santa Barbara

(Complete list of faculty members available inside)(Complete list of faculty members available inside)

1

Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient ComputingPower/Energy Efficient Computinggy p ggy p g

Current Solution: ParallelizationCurrent Solution: Parallelization

Parallelization

2Source: Shekhar Borkar, Intel

Cost and Energy are Still a Big Issue …Cost and Energy are Still a Big Issue …

Cost of computing•HW acquisition•HW acquisition

•Energy bill

•Heat removal

•Space

•…

3

Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient ComputingPower/Energy Efficient Computinggy p ggy p g

Parallelization

Customization

Adapt the architecture to

Application domain

4Source: Shekhar Borkar, Intel

MotivationMotivationA few factsA few facts

We have sufficient computing power for most applicationsEach user/enterprise need high computing power for only selected tasks in its domainEach user/enterprise need high computing power for only selected tasks in its domainApplication-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture

Our proposalOur proposalOur proposalOur proposalA general, customizable platform for the given domain(s)

• Can be customized to a wide-range of applications in the domain• Can be massively produced with cost efficiency• Can be programmed efficiently with novel compilation and runtime systems

Goal: Goal: A “supercomputerA “supercomputer--inin--aa--box” with 100X performance/power improvement via box” with 100X performance/power improvement via p pp p p p pp p pcustomization for the intended domain(s)customization for the intended domain(s)

5

Justification 1 Justification 1 –– Potential of CustomizationPotential of CustomizationPower Figure of MeritThroughputAES 128bit k

350 mW

Power

11 (1/1)3.84 Gbits/sec0.18μm CMOS

Figure of Merit(Gb/s/W)

ThroughputAES 128bit key128bit data

1.32 Gbit/secFPGA [1]

( )

490 mW 2.7 (1/4)

ASM StrongARM [2]

648 Mbits/secAsm

Pentium III [3] 41.4 W 0.015 (1/800)

ASM StrongARM [2]240 mW 0.13 (1/85)31 Mbit/sec

[ ]

Java [5] Emb Sparc 0 0000037 (1/3 000 000)

C Emb. Sparc [4]133 Kbits/sec 0.0011 (1/10,000)120 mW

[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator

[2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110

Java [5] Emb. Sparc450 bits/sec 120 mW

0.0000037 (1/3,000,000)

Source: P. Schaumont, and I. Verbauwhede,

6

[2] Dag Arne Osvik: 544 cycles AES ECB on StrongArm SA 1110

[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet

[4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS

[5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

"Domain specific codesign for embedded security," IEEE Computer 36(4), 2003.

Justification 2 Justification 2 ---- Advance of Civilization Advance of Civilization For human brain, Moore’s Law scaling has long stoppedFor human brain, Moore’s Law scaling has long stopped

The number neurons and their firing speed did not change significantly

Remarkable advancement of civilization via specializationRemarkable advancement of civilization via specializationMore advanced societies have higher degree of specializationMore advanced societies have higher degree of specialization

7

Application Domains: Medical Image Processing & Application Domains: Medical Image Processing & Hemodynamic SimulationHemodynamic Simulationyy

Medical imaging has transformed healthcareMedical imaging has transformed healthcareAn in vivo method for understanding disease d l t d ti t ditidevelopment and patient conditionEstimated to be $100 billion/yearMore powerful & efficient computation can helpp p p

• Fewer exposures using compressive sensing• Better clinical assessment (e.g., for cancer) using

improved registration and segmentation p g galgorithms

Hemodynamic simulation Hemodynamic simulation Very useful for surgical procedures involving

Magnetic resonance (MR) angiograph of an aneurysm

Very useful for surgical procedures involving blood flow and vasculature

Both may take hours to days to constructBoth may take hours to days to construct

8

Clinical requirement: 1Clinical requirement: 1--2 min2 min Intracranial aneurysm reconstruction with hemodynamics

Medical Image Processing PipelineMedical Image Processing Pipeline

compressive sensingreco

nstru

ction

reco

nstru

ction

∑∑ +

Need of Customization for Medical Image Processing PipelineNeed of Customization for Medical Image Processing Pipeline

compressive sensing

iterative, local or global communicationdense and sparse linear algebra, optimization methods

reco

nstru

ction

reco

nstru

ction

•• These algorithms have diverse These algorithms have diverse g

total variational l ith

Non-iterative, highly parallel, local & global communication li l b t t d id ti i ti th dn

oising

no

ising

rr

computation & communication computation & communication patternspatterns

•• A single, homogeneous system A single, homogeneous system

fluid

algorithmsparse linear algebra, structured grid, optimization methods

parallel global communication

dedeati

onati

on

cannot perform very well on all cannot perform very well on all of these algorithmsof these algorithms

•• Need architecture Need architecture t i ti d h dt i ti d h d fluid

registrationparallel, global communicationdense linear algebra, optimization methodsreg

istra

regis

trann

customization and hardwarecustomization and hardware--software cosoftware co--optimizationoptimization

•• Include many common Include many common t ti k l (“ tif ”)t ti k l (“ tif ”) level set

methodslocal communication dense linear algebra, spectral methods, MapReduce

segm

entat

ionse

gmen

tation computation kernels (“motifs”)computation kernels (“motifs”)

•• Applicable to other domainsApplicable to other domains

10

Navier-Stokesequations

local communicationsparse linear algebra, n-body methods, graphical models

analy

sisan

alysis

Center for Domain-Specific Computing (CDSC) Organization

• A diversified & highly accomplished faculty team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math

• 15-20 postdocs and graduate students in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara

Aberle (UCLA)

Baraniuk (Rice)

Bui (UCLA)

Cong (Director) (UCLA)

Cheng (UCSB)

Chang (UCLA)

Reinman (UCLA)

Palsberg (UCLA)

Sadayappan (Ohio-State)

Sarkar(Associate Dir)

Vese (UCLA)

Potkonjak (UCLA)

12

( )( ) ( ) ( )(Rice)

( )( )

Overview of the Proposed ResearchOverview of the Proposed ResearchCustomizable Heterogeneous Platform (CHP)

$ $ $ $ DRAM I/O CHP

FixedCore

FixedCore

FixedCore

FixedCore DRAM CHP CHP

CustomCore

CustomCore

CustomCore

CustomCore

ProgFabric

ProgFabric

ProgFabric

ProgFabric

Domain-specific-modeling(healthcare applications)

Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverOptical interface

A hit t

CHP mappingSource-to-source CHP mapper

Reconfiguring & optimizing backend

CHP creationCustomizable computing engines

Architecture modeling

13

Reconfiguring & optimizing backendAdaptive runtimeCustomizable interconnects

Customization settingDesign once Invoke many times

CHP Creation CHP Creation –– Design Space ExplorationDesign Space ExplorationCore parametersFrequency & voltageDatapath bit width

Customizable Heterogeneous Platform (CHP)

Instruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts

NoC parameters

$ $ $ $

Fixed Fixed Fixed Fixed# of thread contexts…

pInterconnect topology # of virtual channelsRouting policyLink bandwidth

FixedCore

FixedCore

FixedCore

FixedCore

CustomC

CustomC

CustomC

CustomC

Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsC t i t ti l ti

Router pipeline depthNumber of RF-I enabled

routersRF-I channel and

bandwidth allocation

Core Core Core Core

ProgFabric

ProgFabric

ProgFabric

ProgFabric

Custom instruction selectionChoice of accelerators…

…

Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverO ti l i t f

14

Key questions: Optimal trade-off between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?

Optical interface

CDSC Simulation FrameworkCDSC Simulation Framework

15

CHP Mapping CHP Mapping –– Compilation and Runtime Software Systems Compilation and Runtime Software Systems for Customizationfor Customization

Goals: Efficient mapping of domain-specific specification to customizable hardware– Adapt the CHP to a given application for drastic performance/power efficiency improvement

Domain-specific applications

Abstract execution Programmer

Domain-specific programming modelp p g g(Domain-specific coordination graph and domain-specific language extensions)

Source-to source CHP Mapper

Application characteristics

CHP architecture models

C/C++ code Analysis t ti

C/SystemC b h i l

C/C++ front-end

Reconfiguring and optimizing back-end

annotations

RTL Synthesizer

(xPilot)

behavioral spec

Performance feedback

Binary code for fixed & customized cores

Customized target code

RTL for programmable fabric

Adaptive runtimeLightweight threads and adaptive configuration

16

CHP architectural prototypes(CHP hardware testbeds, CHP simulation

testbed, full CHP)

ASIP Compilation Flow [FPGA’04]ASIP Compilation Flow [FPGA’04]

FrontFront--end compilationend compilation

Pattern GenerationSatisfying input/output constraints

C codeC code μ μArchArchconstraintconstraint

1. Pattern generation1. Pattern generation2. Pattern selection2. Pattern selection

Pattern SelectionSelect a subset to maximize

CDFGCDFG

3. Application mapping &3. Application mapping &

Select a subset to maximize the potential speedup while satisfying the resource constraint

Pattern libraryPattern library

Graph coveringGraph covering Application MappingGraph covering tominimize the total execution time

OptimizedOptimizedCDFGCDFG

Backend compilationBackend compilation

execution time

17

Optimized assemblyOptimized assembly

Custom Instruction Selection ResultCustom Instruction Selection ResultExample Application : Rician DenoisingSelected pattern example

Pattern #inst freq in loopPattern #inst freq in-loop

(x1-x2)2 + (x3-x4)2 + x5 7 3 yes

((x1-x2)*x3+x4)*x5 4 3 yes

18

AutoPilot Compilation Tool (based UCLA xPilot system)

C/C++/SystemCC/C++/SystemC

S

Co

User ConstraintsUser Constraints

Design Specification

Platform-based C to FPGA Simulation, V

Compilation & Compilation & ElaborationElaboration

AutoPilotTM

omm

on Test

ESL

Platform-based C to FPGA synthesisSynthesize pure ANSI-C and C++, GCC-compatible

Platform =

Verification,

Presynthesis OptimizationsPresynthesis Optimizations

Behavioral & CommunicationBehavioral & Communication

tbench

L Synthesis

compilation flowFull support of IEEE-754 floating point data types & operations

Timing/Power/Layout Timing/Power/Layout RTL HDLs &RTL HDLs &

Characterization Library

=and Prototy

Behavioral & CommunicationBehavioral & CommunicationSynthesis and OptimizationsSynthesis and Optimizations

operationsEfficiently handle bit-accurate fixed-point arithmeticMore than 10X design

ConstraintsConstraintsRTL SystemCRTL SystemC

FPGAFPGA

yping

gproductivity gainHigh quality-of-results

19

CoCo--ProcessorProcessor

Acceleration of Lithographic Simulation [FPGA’08]

Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) −

Lithography simulationSimulate the optical imaging process

Al ith i C

ψκ(x−x2, y−y1) + ψκ(x−x2, y−y2) − ψκ(x−x1, y−y2)] |2

Computational intensive; very slow for full-chip simulation

AutoPilotTMSynthesis Tool

Algorithm in C

15X+ Performance Improvement vs. AMD Opteron 2 2GHz Processor Opteron 2.2GHz Processor Close to 100X improvement on energy efficiency

15W in FPGA comparing with 86W in Opteron

20

XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180)

Why an ExpeditionAddress a fundamental problem Address a fundamental problem –– energy efficient computingenergy efficient computing

What’s beyond parallelization?Our proposal – a transformative approach using customization

Many challenging research topics, e.g.Many challenging research topics, e.g.Domain-specific modeling/specificationo a spec c ode g/spec cat oNovel architecture & microarchitecture for customizationCompilation and runtime software to support intelligent customizationNew research in testing verification reliability in customizable computingNew research in testing, verification, reliability in customizable computing

Highly integrated effort Highly integrated effort –– coordinated crosscoordinated cross--layer customization in modeling, HW, layer customization in modeling, HW, SW, & application developmentSW, & application developmentDemonstration in a critical application domainDemonstration in a critical application domain

Healthcare has a significant impact to economy and societyCan greatly benefit from customizable domain specific computing

21

Can greatly benefit from customizable domain-specific computing

22

Excepts from NSF Press Release on 10/6/2009

center for domaincenter for domain--specific computing...

Documents