center for domaincenter for domain--specific computing...
TRANSCRIPT
-
Center for DomainCenter for Domain--Specific Computing Specific Computing Supported by NSF “Expedition in Computing” ProgramSupported by NSF “Expedition in Computing” ProgramSupported by NSF Expedition in Computing ProgramSupported by NSF Expedition in Computing Program
www.cdsc.ucla.edu.www.cdsc.ucla.edu.
Di t P f J CDi t P f J CDirector: Prof. Jason CongDirector: Prof. Jason [email protected]@cs.ucla.edu
Participating Universities:Participating Universities:UCLA (lead), Rice, OhioUCLA (lead), Rice, Ohio--State, and UC Santa BarbaraState, and UC Santa Barbara
(Complete list of faculty members available inside)(Complete list of faculty members available inside)
1
-
Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient ComputingPower/Energy Efficient Computinggy p ggy p g
Current Solution: ParallelizationCurrent Solution: Parallelization
Parallelization
2Source: Shekhar Borkar, Intel
-
Cost and Energy are Still a Big Issue …Cost and Energy are Still a Big Issue …
Cost of computing•HW acquisition•HW acquisition
•Energy bill
•Heat removal
•Space
•…
3
-
Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient ComputingPower/Energy Efficient Computinggy p ggy p g
Parallelization
Customization
Adapt the architecture to
Application domain
4Source: Shekhar Borkar, Intel
-
MotivationMotivationA few factsA few facts
We have sufficient computing power for most applicationsEach user/enterprise need high computing power for only selected tasks in its domainEach user/enterprise need high computing power for only selected tasks in its domainApplication-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture
Our proposalOur proposalOur proposalOur proposalA general, customizable platform for the given domain(s)
• Can be customized to a wide-range of applications in the domain• Can be massively produced with cost efficiency• Can be programmed efficiently with novel compilation and runtime systems
Goal: Goal: A “supercomputerA “supercomputer--inin--aa--box” with 100X performance/power improvement via box” with 100X performance/power improvement via p pp p p p pp p pcustomization for the intended domain(s)customization for the intended domain(s)
5
-
Justification 1 Justification 1 –– Potential of CustomizationPotential of CustomizationPower Figure of MeritThroughputAES 128bit k
350 mW
Power
11 (1/1)3.84 Gbits/sec0.18μm CMOS
Figure of Merit(Gb/s/W)
ThroughputAES 128bit key128bit data
1.32 Gbit/secFPGA [1]
( )
490 mW 2.7 (1/4)
ASM StrongARM [2]
648 Mbits/secAsm
Pentium III [3] 41.4 W 0.015 (1/800)
ASM StrongARM [2]240 mW 0.13 (1/85)31 Mbit/sec
[ ]
Java [5] Emb Sparc 0 0000037 (1/3 000 000)
C Emb. Sparc [4]133 Kbits/sec 0.0011 (1/10,000)120 mW
[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator
[2] Dag Arne Osvik: 544 cycles AES – ECB on StrongArm SA-1110
Java [5] Emb. Sparc450 bits/sec 120 mW
0.0000037 (1/3,000,000)
Source: P. Schaumont, and I. Verbauwhede,
6
[2] Dag Arne Osvik: 544 cycles AES ECB on StrongArm SA 1110
[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet
[4] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS
[5] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS
"Domain specific codesign for embedded security," IEEE Computer 36(4), 2003.
-
Justification 2 Justification 2 ---- Advance of Civilization Advance of Civilization For human brain, Moore’s Law scaling has long stoppedFor human brain, Moore’s Law scaling has long stopped
The number neurons and their firing speed did not change significantly
Remarkable advancement of civilization via specializationRemarkable advancement of civilization via specializationMore advanced societies have higher degree of specializationMore advanced societies have higher degree of specialization
7
-
Application Domains: Medical Image Processing & Application Domains: Medical Image Processing & Hemodynamic SimulationHemodynamic Simulationyy
Medical imaging has transformed healthcareMedical imaging has transformed healthcareAn in vivo method for understanding disease d l t d ti t ditidevelopment and patient conditionEstimated to be $100 billion/yearMore powerful & efficient computation can helpp p p
• Fewer exposures using compressive sensing• Better clinical assessment (e.g., for cancer) using
improved registration and segmentation p g galgorithms
Hemodynamic simulation Hemodynamic simulation Very useful for surgical procedures involving
Magnetic resonance (MR) angiograph of an aneurysm
Very useful for surgical procedures involving blood flow and vasculature
Both may take hours to days to constructBoth may take hours to days to construct
8
Clinical requirement: 1Clinical requirement: 1--2 min2 min Intracranial aneurysm reconstruction with hemodynamics
-
Medical Image Processing PipelineMedical Image Processing Pipeline
compressive sensingreco
nstru
ction
reco
nstru
ction
∑∑ +
-
Need of Customization for Medical Image Processing PipelineNeed of Customization for Medical Image Processing Pipeline
compressive sensing
iterative, local or global communicationdense and sparse linear algebra, optimization methods
reco
nstru
ction
reco
nstru
ction
•• These algorithms have diverse These algorithms have diverse g
total variational l ith
Non-iterative, highly parallel, local & global communication li l b t t d id ti i ti th dn
oising
no
ising
rr
computation & communication computation & communication patternspatterns
•• A single, homogeneous system A single, homogeneous system
fluid
algorithmsparse linear algebra, structured grid, optimization methods
parallel global communication
dedeati
onati
on
cannot perform very well on all cannot perform very well on all of these algorithmsof these algorithms
•• Need architecture Need architecture t i ti d h dt i ti d h d fluid
registrationparallel, global communicationdense linear algebra, optimization methodsreg
istra
regis
trann
customization and hardwarecustomization and hardware--software cosoftware co--optimizationoptimization
•• Include many common Include many common t ti k l (“ tif ”)t ti k l (“ tif ”) level set
methodslocal communication dense linear algebra, spectral methods, MapReduce
segm
entat
ionse
gmen
tation computation kernels (“motifs”)computation kernels (“motifs”)
•• Applicable to other domainsApplicable to other domains
10
Navier-Stokesequations
local communicationsparse linear algebra, n-body methods, graphical models
analy
sisan
alysis
-
11
-
Center for Domain-Specific Computing (CDSC) Organization
• A diversified & highly accomplished faculty team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math
• 15-20 postdocs and graduate students in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara
Aberle (UCLA)
Baraniuk (Rice)
Bui (UCLA)
Cong (Director) (UCLA)
Cheng (UCSB)
Chang (UCLA)
Reinman (UCLA)
Palsberg (UCLA)
Sadayappan (Ohio-State)
Sarkar(Associate Dir)
Vese (UCLA)
Potkonjak (UCLA)
12
( )( ) ( ) ( )(Rice)
( )( )
-
Overview of the Proposed ResearchOverview of the Proposed ResearchCustomizable Heterogeneous Platform (CHP)
$ $ $ $ DRAM I/O CHP
FixedCore
FixedCore
FixedCore
FixedCore DRAM CHP CHP
CustomCore
CustomCore
CustomCore
CustomCore
ProgFabric
ProgFabric
ProgFabric
ProgFabric
Domain-specific-modeling(healthcare applications)
Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverOptical interface
A hit t
CHP mappingSource-to-source CHP mapper
Reconfiguring & optimizing backend
CHP creationCustomizable computing engines
Architecture modeling
13
Reconfiguring & optimizing backendAdaptive runtimeCustomizable interconnects
Customization settingDesign once Invoke many times
-
CHP Creation CHP Creation –– Design Space ExplorationDesign Space ExplorationCore parametersFrequency & voltageDatapath bit width
Customizable Heterogeneous Platform (CHP)
Instruction window sizeIssue widthCache size & configurationRegister file organization# of thread contexts
NoC parameters
$ $ $ $
Fixed Fixed Fixed Fixed# of thread contexts…
pInterconnect topology # of virtual channelsRouting policyLink bandwidth
FixedCore
FixedCore
FixedCore
FixedCore
CustomC
CustomC
CustomC
CustomC
Custom instructions & acceleratorsAmount of programmable fabric Shared vs. private acceleratorsC t i t ti l ti
Router pipeline depthNumber of RF-I enabled
routersRF-I channel and
bandwidth allocation
Core Core Core Core
ProgFabric
ProgFabric
ProgFabric
ProgFabric
Custom instruction selectionChoice of accelerators…
…
Reconfigurable RF-I busReconfigurable optical busTransceiver/receiverO ti l i t f
14
Key questions: Optimal trade-off between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?
Optical interface
-
CDSC Simulation FrameworkCDSC Simulation Framework
15
-
CHP Mapping CHP Mapping –– Compilation and Runtime Software Systems Compilation and Runtime Software Systems for Customizationfor Customization
Goals: Efficient mapping of domain-specific specification to customizable hardware– Adapt the CHP to a given application for drastic performance/power efficiency improvement
Domain-specific applications
Abstract execution Programmer
Domain-specific programming modelp p g g(Domain-specific coordination graph and domain-specific language extensions)
Source-to source CHP Mapper
Application characteristics
CHP architecture models
C/C++ code Analysis t ti
C/SystemC b h i l
C/C++ front-end
Reconfiguring and optimizing back-end
annotations
RTL Synthesizer
(xPilot)
behavioral spec
Performance feedback
Binary code for fixed & customized cores
Customized target code
RTL for programmable fabric
Adaptive runtimeLightweight threads and adaptive configuration
16
CHP architectural prototypes(CHP hardware testbeds, CHP simulation
testbed, full CHP)
-
ASIP Compilation Flow [FPGA’04]ASIP Compilation Flow [FPGA’04]
FrontFront--end compilationend compilation
Pattern GenerationSatisfying input/output constraints
C codeC code μ μArchArchconstraintconstraint
1. Pattern generation1. Pattern generation2. Pattern selection2. Pattern selection
Pattern SelectionSelect a subset to maximize
CDFGCDFG
3. Application mapping &3. Application mapping &
Select a subset to maximize the potential speedup while satisfying the resource constraint
Pattern libraryPattern library
Graph coveringGraph covering Application MappingGraph covering tominimize the total execution time
OptimizedOptimizedCDFGCDFG
Backend compilationBackend compilation
execution time
17
Optimized assemblyOptimized assembly
-
Custom Instruction Selection ResultCustom Instruction Selection ResultExample Application : Rician DenoisingSelected pattern example
Pattern #inst freq in loopPattern #inst freq in-loop
(x1-x2)2 + (x3-x4)2 + x5 7 3 yes
((x1-x2)*x3+x4)*x5 4 3 yes
18
-
AutoPilot Compilation Tool (based UCLA xPilot system)
C/C++/SystemCC/C++/SystemC
S
Co
User ConstraintsUser Constraints
Design Specification
Platform-based C to FPGA Simulation, V
Compilation & Compilation & ElaborationElaboration
AutoPilotTM
omm
on Test
ESL
Platform-based C to FPGA synthesisSynthesize pure ANSI-C and C++, GCC-compatible
Platform =
Verification,
Presynthesis OptimizationsPresynthesis Optimizations
Behavioral & CommunicationBehavioral & Communication
tbench
L Synthesis
compilation flowFull support of IEEE-754 floating point data types & operations
Timing/Power/Layout Timing/Power/Layout RTL HDLs &RTL HDLs &
Characterization Library
=and Prototy
Behavioral & CommunicationBehavioral & CommunicationSynthesis and OptimizationsSynthesis and Optimizations
operationsEfficiently handle bit-accurate fixed-point arithmeticMore than 10X design
ConstraintsConstraintsRTL SystemCRTL SystemC
FPGAFPGA
yping
gproductivity gainHigh quality-of-results
19
CoCo--ProcessorProcessor
-
Acceleration of Lithographic Simulation [FPGA’08]
Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) −
Lithography simulationSimulate the optical imaging process
Al ith i C
ψκ(x−x2, y−y1) + ψκ(x−x2, y−y2) − ψκ(x−x1, y−y2)] |2
Computational intensive; very slow for full-chip simulation
AutoPilotTMSynthesis Tool
Algorithm in C
15X+ Performance Improvement vs. AMD Opteron 2 2GHz Processor Opteron 2.2GHz Processor Close to 100X improvement on energy efficiency
15W in FPGA comparing with 86W in Opteron
20
XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180)
-
Why an ExpeditionAddress a fundamental problem Address a fundamental problem –– energy efficient computingenergy efficient computing
What’s beyond parallelization?Our proposal – a transformative approach using customization
Many challenging research topics, e.g.Many challenging research topics, e.g.Domain-specific modeling/specificationo a spec c ode g/spec cat oNovel architecture & microarchitecture for customizationCompilation and runtime software to support intelligent customizationNew research in testing verification reliability in customizable computingNew research in testing, verification, reliability in customizable computing
Highly integrated effort Highly integrated effort –– coordinated crosscoordinated cross--layer customization in modeling, HW, layer customization in modeling, HW, SW, & application developmentSW, & application developmentDemonstration in a critical application domainDemonstration in a critical application domain
Healthcare has a significant impact to economy and societyCan greatly benefit from customizable domain specific computing
21
Can greatly benefit from customizable domain-specific computing
-
22
Excepts from NSF Press Release on 10/6/2009