center for domaincenter for domain--specific computing...

22
Center for Domain Center for Domain-Specific Computing Specific Computing Supported by NSF Expedition in ComputingProgram Supported by NSF Expedition in ComputingProgram Supported by NSF Expedition in Computing Program Supported by NSF Expedition in Computing Program www.cdsc.ucla.edu. www.cdsc.ucla.edu. Di t P f J C Di t P f J C Director: Prof. Jason Cong Director: Prof. Jason Cong [email protected] [email protected] Participating Universities: Participating Universities: UCLA (lead), Rice, Ohio UCLA (lead), Rice, Ohio-State, and UC Santa Barbara State, and UC Santa Barbara (Complete list of faculty members available inside) (Complete list of faculty members available inside) 1

Upload: others

Post on 27-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Center for DomainCenter for Domain--Specific Computing Specific Computing Supported by NSF “Expedition in Computing” ProgramSupported by NSF “Expedition in Computing” ProgramSupported by NSF Expedition in Computing ProgramSupported by NSF Expedition in Computing Program

    www.cdsc.ucla.edu.www.cdsc.ucla.edu.

    Di t P f J CDi t P f J CDirector: Prof. Jason CongDirector: Prof. Jason [email protected]@cs.ucla.edu

    Participating Universities:Participating Universities:UCLA (lead), Rice, OhioUCLA (lead), Rice, Ohio--State, and UC Santa BarbaraState, and UC Santa Barbara

    (Complete list of faculty members available inside)(Complete list of faculty members available inside)

    1

  • Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient ComputingPower/Energy Efficient Computinggy p ggy p g

    Current Solution: ParallelizationCurrent Solution: Parallelization

    Parallelization

    2Source: Shekhar Borkar, Intel

  • Cost and Energy are Still a Big Issue …Cost and Energy are Still a Big Issue …

    Cost of computing•HW acquisition•HW acquisition

    •Energy bill

    •Heat removal

    •Space

    •…

    3

  • Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient ComputingPower/Energy Efficient Computinggy p ggy p g

    Parallelization

    Customization

    Adapt the architecture to

    Application domain

    4Source: Shekhar Borkar, Intel

  • MotivationMotivationA few factsA few facts

    We have sufficient computing power for most applicationsEach user/enterprise need high computing power for only selected tasks in its domainEach user/enterprise need high computing power for only selected tasks in its domainApplication-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture

    Our proposalOur proposalOur proposalOur proposalA general, customizable platform for the given domain(s)

    • Can be customized to a wide-range of applications in the domain• Can be massively produced with cost efficiency• Can be programmed efficiently with novel compilation and runtime systems

    Goal: Goal: A “supercomputerA “supercomputer--inin--aa--box” with 100X performance/power improvement via box” with 100X performance/power improvement via p pp p p p pp p pcustomization for the intended domain(s)customization for the intended domain(s)

    5

  • Justification 1 Justification 1 –– Potential of CustomizationPotential of CustomizationPower Figure of MeritThroughputAES 128bit k

    350 mW

    Power

    11 (1/1)3.84 Gbits/sec0.18μm CMOS

    Figure of Merit(Gb/s/W)

    ThroughputAES 128bit key128bit data

    1.32 Gbit/secFPGA [1]

    ( )

    490 mW 2.7 (1/4)

    ASM StrongARM [2]

    648 Mbits/secAsm

    Pentium III [3] 41.4 W 0.015 (1/800)

    ASM StrongARM [2]240 mW 0.13 (1/85)31 Mbit/sec

    [ ]

    Java [5] Emb Sparc 0 0000037 (1/3 000 000)

    C Emb. Sparc [4]133 Kbits/sec 0.0011 (1/10,000)120 mW

    [1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator

    [2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110

    Java [5] Emb. Sparc450 bits/sec 120 mW

    0.0000037 (1/3,000,000)

    Source: P. Schaumont, and I. Verbauwhede,

    6

    [2] Dag Arne Osvik: 544 cycles AES ECB on StrongArm SA 1110

    [3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet

    [4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS

    [5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

    "Domain specific codesign for embedded security," IEEE Computer 36(4), 2003.

  • Justification 2 Justification 2 ---- Advance of Civilization Advance of Civilization For human brain, Moore’s Law scaling has long stoppedFor human brain, Moore’s Law scaling has long stopped

    The number neurons and their firing speed did not change significantly

    Remarkable advancement of civilization via specializationRemarkable advancement of civilization via specializationMore advanced societies have higher degree of specializationMore advanced societies have higher degree of specialization

    7

  • Application Domains: Medical Image Processing & Application Domains: Medical Image Processing & Hemodynamic SimulationHemodynamic Simulationyy

    Medical imaging has transformed healthcareMedical imaging has transformed healthcareAn in vivo method for understanding disease d l t d ti t ditidevelopment and patient conditionEstimated to be $100 billion/yearMore powerful & efficient computation can helpp p p

    • Fewer exposures using compressive sensing• Better clinical assessment (e.g., for cancer) using

    improved registration and segmentation p g galgorithms

    Hemodynamic simulation Hemodynamic simulation Very useful for surgical procedures involving

    Magnetic resonance (MR) angiograph of an aneurysm

    Very useful for surgical procedures involving blood flow and vasculature

    Both may take hours to days to constructBoth may take hours to days to construct

    8

    Clinical requirement: 1Clinical requirement: 1--2 min2 min Intracranial aneurysm reconstruction with hemodynamics

  • Medical Image Processing PipelineMedical Image Processing Pipeline

    compressive sensingreco

    nstru

    ction

    reco

    nstru

    ction

    ∑∑ +

  • Need of Customization for Medical Image Processing PipelineNeed of Customization for Medical Image Processing Pipeline

    compressive sensing

    iterative, local or global communicationdense and sparse linear algebra, optimization methods

    reco

    nstru

    ction

    reco

    nstru

    ction

    •• These algorithms have diverse These algorithms have diverse g

    total variational l ith

    Non-iterative, highly parallel, local & global communication li l b t t d id ti i ti th dn

    oising

    no

    ising

    rr

    computation & communication computation & communication patternspatterns

    •• A single, homogeneous system A single, homogeneous system

    fluid

    algorithmsparse linear algebra, structured grid, optimization methods

    parallel global communication

    dedeati

    onati

    on

    cannot perform very well on all cannot perform very well on all of these algorithmsof these algorithms

    •• Need architecture Need architecture t i ti d h dt i ti d h d fluid

    registrationparallel, global communicationdense linear algebra, optimization methodsreg

    istra

    regis

    trann

    customization and hardwarecustomization and hardware--software cosoftware co--optimizationoptimization

    •• Include many common Include many common t ti k l (“ tif ”)t ti k l (“ tif ”) level set

    methodslocal communication dense linear algebra, spectral methods, MapReduce

    segm

    entat

    ionse

    gmen

    tation computation kernels (“motifs”)computation kernels (“motifs”)

    •• Applicable to other domainsApplicable to other domains

    10

    Navier-Stokesequations

    local communicationsparse linear algebra, n-body methods, graphical models

    analy

    sisan

    alysis

  • 11

  • Center for Domain-Specific Computing (CDSC) Organization

    • A diversified & highly accomplished faculty team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math

    • 15-20 postdocs and graduate students in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara

    Aberle (UCLA)

    Baraniuk (Rice)

    Bui (UCLA)

    Cong (Director) (UCLA)

    Cheng (UCSB)

    Chang (UCLA)

    Reinman (UCLA)

    Palsberg (UCLA)

    Sadayappan (Ohio-State)

    Sarkar(Associate Dir)

    Vese (UCLA)

    Potkonjak (UCLA)

    12

    ( )( ) ( ) ( )(Rice)

    ( )( )

  • Overview of the Proposed ResearchOverview of the Proposed ResearchCustomizable Heterogeneous Platform (CHP)

    $ $ $ $ DRAM I/O CHP

    FixedCore

    FixedCore

    FixedCore

    FixedCore DRAM CHP CHP

    CustomCore

    CustomCore

    CustomCore

    CustomCore

    ProgFabric

    ProgFabric

    ProgFabric

    ProgFabric

    Domain-specific-modeling(healthcare applications)

    Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverOptical interface

    A hit t

    CHP mappingSource-to-source CHP mapper

    Reconfiguring & optimizing backend

    CHP creationCustomizable computing engines

    Architecture modeling

    13

    Reconfiguring & optimizing backendAdaptive runtimeCustomizable interconnects

    Customization settingDesign once Invoke many times

  • CHP Creation CHP Creation –– Design Space ExplorationDesign Space ExplorationCore parametersFrequency & voltageDatapath bit width

    Customizable Heterogeneous Platform (CHP)

    Instruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts

    NoC parameters

    $ $ $ $

    Fixed Fixed Fixed Fixed# of thread contexts…

    pInterconnect topology # of virtual channelsRouting policyLink bandwidth

    FixedCore

    FixedCore

    FixedCore

    FixedCore

    CustomC

    CustomC

    CustomC

    CustomC

    Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsC t i t ti l ti

    Router pipeline depthNumber of RF-I enabled

    routersRF-I channel and

    bandwidth allocation

    Core Core Core Core

    ProgFabric

    ProgFabric

    ProgFabric

    ProgFabric

    Custom instruction selectionChoice of accelerators…

    Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverO ti l i t f

    14

    Key questions: Optimal trade-off between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?

    Optical interface

  • CDSC Simulation FrameworkCDSC Simulation Framework

    15

  • CHP Mapping CHP Mapping –– Compilation and Runtime Software Systems Compilation and Runtime Software Systems for Customizationfor Customization

    Goals: Efficient mapping of domain-specific specification to customizable hardware– Adapt the CHP to a given application for drastic performance/power efficiency improvement

    Domain-specific applications

    Abstract execution Programmer

    Domain-specific programming modelp p g g(Domain-specific coordination graph and domain-specific language extensions)

    Source-to source CHP Mapper

    Application characteristics

    CHP architecture models

    C/C++ code Analysis t ti

    C/SystemC b h i l

    C/C++ front-end

    Reconfiguring and optimizing back-end

    annotations

    RTL Synthesizer

    (xPilot)

    behavioral spec

    Performance feedback

    Binary code for fixed & customized cores

    Customized target code

    RTL for programmable fabric

    Adaptive runtimeLightweight threads and adaptive configuration

    16

    CHP architectural prototypes(CHP hardware testbeds, CHP simulation

    testbed, full CHP)

  • ASIP Compilation Flow [FPGA’04]ASIP Compilation Flow [FPGA’04]

    FrontFront--end compilationend compilation

    Pattern GenerationSatisfying input/output constraints

    C codeC code μ μArchArchconstraintconstraint

    1. Pattern generation1. Pattern generation2. Pattern selection2. Pattern selection

    Pattern SelectionSelect a subset to maximize

    CDFGCDFG

    3. Application mapping &3. Application mapping &

    Select a subset to maximize the potential speedup while satisfying the resource constraint

    Pattern libraryPattern library

    Graph coveringGraph covering Application MappingGraph covering tominimize the total execution time

    OptimizedOptimizedCDFGCDFG

    Backend compilationBackend compilation

    execution time

    17

    Optimized assemblyOptimized assembly

  • Custom Instruction Selection ResultCustom Instruction Selection ResultExample Application : Rician DenoisingSelected pattern example

    Pattern #inst freq in loopPattern #inst freq in-loop

    (x1-x2)2 + (x3-x4)2 + x5 7 3 yes

    ((x1-x2)*x3+x4)*x5 4 3 yes

    18

  • AutoPilot Compilation Tool (based UCLA xPilot system)

    C/C++/SystemCC/C++/SystemC

    S

    Co

    User ConstraintsUser Constraints

    Design Specification

    Platform-based C to FPGA Simulation, V

    Compilation & Compilation & ElaborationElaboration

    AutoPilotTM

    omm

    on Test

    ESL

    Platform-based C to FPGA synthesisSynthesize pure ANSI-C and C++, GCC-compatible

    Platform =

    Verification,

    Presynthesis OptimizationsPresynthesis Optimizations

    Behavioral & CommunicationBehavioral & Communication

    tbench

    L Synthesis

    compilation flowFull support of IEEE-754 floating point data types & operations

    Timing/Power/Layout Timing/Power/Layout RTL HDLs &RTL HDLs &

    Characterization Library

    =and Prototy

    Behavioral & CommunicationBehavioral & CommunicationSynthesis and OptimizationsSynthesis and Optimizations

    operationsEfficiently handle bit-accurate fixed-point arithmeticMore than 10X design

    ConstraintsConstraintsRTL SystemCRTL SystemC

    FPGAFPGA

    yping

    gproductivity gainHigh quality-of-results

    19

    CoCo--ProcessorProcessor

  • Acceleration of Lithographic Simulation [FPGA’08]

    Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) −

    Lithography simulationSimulate the optical imaging process

    Al ith i C

    ψκ(x−x2, y−y1) + ψκ(x−x2, y−y2) − ψκ(x−x1, y−y2)] |2

    Computational intensive; very slow for full-chip simulation

    AutoPilotTMSynthesis Tool

    Algorithm in C

    15X+ Performance Improvement vs. AMD Opteron 2 2GHz Processor Opteron 2.2GHz Processor Close to 100X improvement on energy efficiency

    15W in FPGA comparing with 86W in Opteron

    20

    XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180)

  • Why an ExpeditionAddress a fundamental problem Address a fundamental problem –– energy efficient computingenergy efficient computing

    What’s beyond parallelization?Our proposal – a transformative approach using customization

    Many challenging research topics, e.g.Many challenging research topics, e.g.Domain-specific modeling/specificationo a spec c ode g/spec cat oNovel architecture & microarchitecture for customizationCompilation and runtime software to support intelligent customizationNew research in testing verification reliability in customizable computingNew research in testing, verification, reliability in customizable computing

    Highly integrated effort Highly integrated effort –– coordinated crosscoordinated cross--layer customization in modeling, HW, layer customization in modeling, HW, SW, & application developmentSW, & application developmentDemonstration in a critical application domainDemonstration in a critical application domain

    Healthcare has a significant impact to economy and societyCan greatly benefit from customizable domain specific computing

    21

    Can greatly benefit from customizable domain-specific computing

  • 22

    Excepts from NSF Press Release on 10/6/2009