era of customization and specialization - uclacadlab.cs.ucla.edu/~cong/slides/carl12_2.pdf · era...
TRANSCRIPT
Era of Customization and SpecializationEra of Customization and Specialization
Jason CongC f C C SChancellor’s Professor, UCLA Computer Science Department
[email protected] Center for Domain Specific ComputingDirector, Center for Domain-Specific Computing
www.cdsc.ucla.edu
11
Power Barrier and Current SolutionPower Barrier and Current Solution
•• 10’s to 100’s cores in a processor10’s to 100’s cores in a processor
•• 1000’s to 10,000’s servers in a data center1000’s to 10,000’s servers in a data center
ParallelizationParallelization
22
Utilization Wall Utilization Wall [G. Venkatesh et.al. ASPLOS’10][G. Venkatesh et.al. ASPLOS’10]
♦♦ Assuming 80W power budgetAssuming 80W power budget♦♦ Assuming 80W power budget,Assuming 80W power budget,At 45 nm TSMC process, less than 7% of a 300mmAt 45 nm TSMC process, less than 7% of a 300mm22 die can be die can be switchedswitchedswitched.switched.
♦♦ ITRS roadmap and CMOS scaling theory:ITRS roadmap and CMOS scaling theory:Less than 3 5% in 32 nmLess than 3 5% in 32 nmLess than 3.5% in 32 nmLess than 3.5% in 32 nm
Almost half with each process generationAlmost half with each process generation
E f th ith 3E f th ith 3 D i t tiD i t tiEven further with 3Even further with 3--D integration.D integration.
33
Dark Silicon and the End of Multicore Scaling Dark Silicon and the End of Multicore Scaling [H Esmaeilzadeh et al ISCA'11][H Esmaeilzadeh et al ISCA'11][H. Esmaeilzadeh et. al., ISCA 11][H. Esmaeilzadeh et. al., ISCA 11]
♦♦ Power wall:Power wall:At 22 nm, 31% of a fixedAt 22 nm, 31% of a fixed--size size chip must be powered offchip must be powered off
At 8 nm more than 50%At 8 nm more than 50%At 8 nm, more than 50%.At 8 nm, more than 50%.
♦♦ A significant gap between A significant gap between what is achievable and what is achievable and Percent dark silicon: geomeanPercent dark silicon: geomeanwhat is achievable and what is achievable and what is expected by what is expected by Moore’s LawMoore’s Law
gg
Due to power and parallelism Due to power and parallelism limitationslimitations
Speedup gap of at least 22x at 8 Speedup gap of at least 22x at 8 Speedup gap of at least 22x at 8 Speedup gap of at least 22x at 8 nm technologynm technology
44
Next Big Opportunity Next Big Opportunity –– Customization and SpecializationCustomization and Specialization
ParallelizationParallelization
C t i tiC t i tiCustomizationCustomization
Adapt the architecture to Adapt the architecture to
Application domainApplication domain
55
Potential of Customization/SpecializationPotential of Customization/Specialization
PowerPowerPowerPower Figure of MeritFigure of Merit(Gb/s/W)(Gb/s/W)Figure of MeritFigure of Merit(Gb/s/W)(Gb/s/W)
ThroughputThroughputThroughputThroughputAES 128bit keyAES 128bit key128bit data128bit dataAES 128bit keyAES 128bit key128bit data128bit data
PowerPower Figure of MeritFigure of Merit(Gb/s/W)(Gb/s/W)
ThroughputThroughputAES 128bit keyAES 128bit key128bit data128bit data
350 mW350 mW350 mW350 mW 11 (1/1)11 (1/1)11 (1/1)11 (1/1)3.84 3.84 GbitsGbits/sec/sec3.84 3.84 GbitsGbits/sec/sec0.18mm CMOS0.18mm CMOS0.18mm CMOS0.18mm CMOS 350 mW350 mW 11 (1/1)11 (1/1)3.84 3.84 GbitsGbits/sec/sec0.18mm CMOS0.18mm CMOS
1.32 Gbit/sec1.32 Gbit/sec1.32 Gbit/sec1.32 Gbit/secFPGA [1]FPGA [1]FPGA [1]FPGA [1] 490 mW490 mW490 mW490 mW 2.7 (1/4)2.7 (1/4)2.7 (1/4)2.7 (1/4)
ASM ASM StrongARMStrongARM [2][2]ASM ASM StrongARMStrongARM [2][2] 240 240 mWmW240 240 mWmW 0 13 (1/85)0 13 (1/85)0 13 (1/85)0 13 (1/85)31 Mbit/sec31 Mbit/sec31 Mbit/sec31 Mbit/sec
1.32 Gbit/sec1.32 Gbit/secFPGA [1]FPGA [1] 490 mW490 mW 2.7 (1/4)2.7 (1/4)
ASM ASM StrongARMStrongARM [2][2] 240 240 mWmW 0 13 (1/85)0 13 (1/85)31 Mbit/sec31 Mbit/sec
648 Mbits/sec648 Mbits/sec648 Mbits/sec648 Mbits/secASM Pentium III [3]ASM Pentium III [3]ASM Pentium III [3]ASM Pentium III [3] 41.4 W41.4 W41.4 W41.4 W 0.015 (1/800)0.015 (1/800)0.015 (1/800)0.015 (1/800)
ASM ASM StrongARMStrongARM [2][2]ASM ASM StrongARMStrongARM [2][2] 240 240 mWmW240 240 mWmW 0.13 (1/85)0.13 (1/85)0.13 (1/85)0.13 (1/85)31 Mbit/sec31 Mbit/sec31 Mbit/sec31 Mbit/sec
648 Mbits/sec648 Mbits/secASM Pentium III [3]ASM Pentium III [3] 41.4 W41.4 W 0.015 (1/800)0.015 (1/800)
ASM ASM StrongARMStrongARM [2][2] 240 240 mWmW 0.13 (1/85)0.13 (1/85)31 Mbit/sec31 Mbit/sec
C C EmbEmb. . SparcSparc [4][4]C C EmbEmb. . SparcSparc [4][4] 133 Kbits/sec133 Kbits/sec133 Kbits/sec133 Kbits/sec 0.0011 (1/10,000)0.0011 (1/10,000)0.0011 (1/10,000)0.0011 (1/10,000)120 mW120 mW120 mW120 mWC C EmbEmb. . SparcSparc [4][4] 133 Kbits/sec133 Kbits/sec 0.0011 (1/10,000)0.0011 (1/10,000)120 mW120 mW
[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator[2] Dag Arne Osvik: 544 cycles AES [2] Dag Arne Osvik: 544 cycles AES ECB on StrongArm SAECB on StrongArm SA 11101110
Java [5] Emb. SparcJava [5] Emb. SparcJava [5] Emb. SparcJava [5] Emb. Sparc 450 bits/sec450 bits/sec450 bits/sec450 bits/sec 120 mW120 mW120 mW120 mW 0.0000037 (1/3,000,000)0.0000037 (1/3,000,000)0.0000037 (1/3,000,000)0.0000037 (1/3,000,000)Java [5] Emb. SparcJava [5] Emb. Sparc 450 bits/sec450 bits/sec 120 mW120 mW 0.0000037 (1/3,000,000)0.0000037 (1/3,000,000)
66
[2] Dag Arne Osvik: 544 cycles AES [2] Dag Arne Osvik: 544 cycles AES –– ECB on StrongArm SAECB on StrongArm SA--11101110[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet[3] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet[4] gcc, 1 mW/MHz @ 120 Mhz Sparc [4] gcc, 1 mW/MHz @ 120 Mhz Sparc –– assumes 0.25 u CMOSassumes 0.25 u CMOS[5] Java on KVM (Sun J2ME, non[5] Java on KVM (Sun J2ME, non--JIT) on 1 mW/MHz @ 120 MHz Sparc JIT) on 1 mW/MHz @ 120 MHz Sparc –– assumes 0.25 u CMOSassumes 0.25 u CMOS
Source: Source: P Schaumont and I Verbauwhede, "Domain specific P Schaumont and I Verbauwhede, "Domain specific codesign for embedded security," IEEE Computer 36(4), 2003codesign for embedded security," IEEE Computer 36(4), 2003
Another Example of Specialization Another Example of Specialization ---- Advance of Civilization Advance of Civilization
♦♦ For human brain, Moore’s Law scaling has long stoppedFor human brain, Moore’s Law scaling has long stoppedThe number neurons and their firing speed did not change significantly The number neurons and their firing speed did not change significantly
♦♦ Remarkable advancement of civilization via specializationRemarkable advancement of civilization via specialization
♦♦ More advanced societies have higher degree of specializationMore advanced societies have higher degree of specialization
♦♦ Achieved on a common platform!Achieved on a common platform!
77
Example of Customizable Platforms: FPGAsExample of Customizable Platforms: FPGAsExample of Customizable Platforms: FPGAsExample of Customizable Platforms: FPGAsConfigurable logic Configurable logic blocksblocksblocksblocks
IslandIsland--style configurable style configurable mesh routingmesh routinggg
Dedicated componentsDedicated componentsSpecialization allows Specialization allows
ti i titi i tioptimizationoptimizationMemory/MultiplierMemory/MultiplierI/O, ProcessorI/O, Processor,,Anything that the FPGA Anything that the FPGA architect wants to put in!architect wants to put in!
88
Source: I. Kuon, R. Tessier, J. Rose. FPGA Source: I. Kuon, R. Tessier, J. Rose. FPGA
Architecture: Survey and Challenges. 2008.Architecture: Survey and Challenges. 2008.
More Opportunities for Customization to be ExploredMore Opportunities for Customization to be Explored
Our Proposal: Customizable Heterogeneous Platform (CHP)Our Proposal: Customizable Heterogeneous Platform (CHP)Cache parametersCache parametersCache size & configurationCache size & configuration
Core parametersCore parametersFrequency & voltageFrequency & voltageD t th bit idthD t th bit idth
$$ $$ $$ $$
ggCache vs SPMCache vs SPM……
Datapath bit widthDatapath bit widthInstruction window sizeInstruction window sizeIssue widthIssue widthCache size & Cache size &
configurationconfiguration
NoC parametersNoC parametersInterconnect topology Interconnect topology # of virtual channels# of virtual channels
FixedFixedCoreCore
FixedFixedCoreCore
FixedFixedCoreCore
FixedFixedCoreCore
configurationconfigurationRegister file organizationRegister file organization# of thread contexts# of thread contexts
……
Routing policyRouting policyLink bandwidthLink bandwidthRouter pipeline depthRouter pipeline depthNumber of RFNumber of RF--I enabled I enabled
CustomCustomCoreCore
CustomCustomCoreCore
CustomCustomCoreCore
CustomCustomCoreCore
Custom instructions & acceleratorsCustom instructions & acceleratorsShared vs. private acceleratorsShared vs. private accelerators
Number of RFNumber of RF I enabled I enabled routersroutersRFRF--I channel and I channel and
bandwidth allocationbandwidth allocation……
ProgProgFabricFabric
ProgProgFabricFabric
acceleratoracceleratoracceleratoraccelerator acceleratoracceleratoracceleratoraccelerator
Choice of acceleratorsChoice of acceleratorsCustom instruction selectionCustom instruction selectionAmount of programmable fabric Amount of programmable fabric ……
Reconfigurable RFReconfigurable RF--I busI busReconfigurable optical busReconfigurable optical busTransceiver/receiverTransceiver/receiverOptical interfaceOptical interface
99
Key questions:Key questions: Optimal tradeOptimal trade--off between efficiency & customizabilityoff between efficiency & customizabilityWhich options to fix at CHP creation? Which to be set by CHP mapper?Which options to fix at CHP creation? Which to be set by CHP mapper?
pp
Research Scope in CDSC (Center for DomainResearch Scope in CDSC (Center for Domain--Specific Computing)Specific Computing)
Customizable Heterogeneous Platform Customizable Heterogeneous Platform (CHP)(CHP)
$$ $$ $$ $$ DRAMDRAM I/OI/O CHPCHP
Specific Computing)Specific Computing)
$$ $$ $$ $$
FixedFixedCoreCore
FixedFixedCoreCore
FixedFixedCoreCore
FixedFixedCoreCore
DRAMDRAM
DRAMDRAM
I/OI/O
CHPCHP
CHPCHP
CHPCHP
CustomCustomCoreCore
CustomCustomCoreCore
CustomCustomCoreCore
CustomCustomCoreCore DomainDomain--specificspecific--modelingmodeling
(healthcare applications)(healthcare applications)
ProgProgFabricFabric
ProgProgFabricFabric
acceleratoracceleratoracceleratoraccelerator acceleratoracceleratoracceleratoraccelerator
(healthcare applications)(healthcare applications)
Reconfigurable RFReconfigurable RF--I busI busReconfigurable optical busReconfigurable optical busTransceiver/receiverTransceiver/receiverOptical interfaceOptical interface
A hit t A hit t CHP mappingCHP mapping
SourceSource--toto--source CHP mapper source CHP mapper Reconfiguring & optimizing backendReconfiguring & optimizing backend
CHP creationCHP creationCustomizable computing engines Customizable computing engines
Customizable interconnectsCustomizable interconnects
Architecture Architecture
modelingmodeling
1010
Adaptive runtimeAdaptive runtimeCustomizable interconnectsCustomizable interconnectsCustomization Customization
settingsettingDesign onceDesign once Invoke many timesInvoke many times
Examples of EnergyExamples of Energy--Efficient CustomizationEfficient CustomizationExamples of EnergyExamples of Energy Efficient CustomizationEfficient Customization
♦♦ Customization of processor coresCustomization of processor cores♦♦ Customization of onCustomization of on--chip memorychip memory♦♦ Customization of onCustomization of on--chip interconnectschip interconnects♦♦ Customization of onCustomization of on chip interconnectschip interconnects
1111
Extensive Use of AcceleratorsExtensive Use of AcceleratorsExtensive Use of AcceleratorsExtensive Use of Accelerators♦♦ Accelerators provide high powerAccelerators provide high power--efficiency over generalefficiency over general--purpose processorspurpose processors
IBM wire-speed processorI t l L bIntel Larrabee
♦♦ ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022 ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022
♦♦ Two kinds of acceleratorsTwo kinds of acceleratorsTightly coupled – part of datapathLoosely coupled – shared via NoCLoosely coupled shared via NoC
♦♦ ChallengesChallengesAccelerator extraction and synthesisAccelerator extraction and synthesisEfficient accelerator managementEfficient accelerator managementEfficient accelerator managementEfficient accelerator management•• SchedulingScheduling•• SharingSharing•• Virtualization …Virtualization …Friendly programming modelsFriendly programming models
1212
Architecture Support for AcceleratorArchitecture Support for Accelerator--Rich CMPs (ARC) Rich CMPs (ARC) [DAC’2012][DAC’2012]
MotivationMotivation
CPUCPU
OSOS
Accelerator Accelerator ManagerManager AcceleratorAccelerator
Operation Latency (# Cycles)
1 core 2 cores 4 cores 8 cores 16 coresAppApp 1 core 2 cores 4 cores 8 cores 16 cores
Invoke 214413 256401 266133 308434 316161
RD/WR 703 725 781 837 885
AppApp
♦♦ Managing accelerators through the OS is expensiveManaging accelerators through the OS is expensive
♦♦ In an accelerator rich CMP, management should be cheaper both in In an accelerator rich CMP, management should be cheaper both in terms of time and energyterms of time and energy
1313
Invoke “Open”s the driver and returns the handler to driver. Called once.
RD/WR is called multiple times.
Overall Architecture of ARCOverall Architecture of ARCOverall Architecture of ARCOverall Architecture of ARC
♦♦ Architecture of ARCArchitecture of ARCM lti l d l tMultiple cores and acceleratorsGlobal Accelerator Manager (GAM)(GAM)Shared L2 cache banks and NoC routers between multiple paccelerators
GAMGAMGAMGAMAccelerator + Accelerator + Shared Shared CoreCore
Shared Shared Memory Memory
1414
DMA+SPMDMA+SPM RouterRouterCoreCore L2 $L2 $Memory Memory
controllercontroller
Overall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARC
CPU GAM11
New ISA
lcacc-req t
l t
CPU GAM
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free idMemory LCA
1.1. The core requests for a given type of accelerator (lcaccThe core requests for a given type of accelerator (lcacc--req).req).
lcacc free idy
q g yp (q g yp ( q)q)
1515
Overall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARC
CPU GAM22New ISA
lcacc-req t
l t
CPU GAM22
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free idMemory LCA
2.2. The GAM responds with a “list + waiting time” or NACKThe GAM responds with a “list + waiting time” or NACK
lcacc free idy
p gp g
1616
Overall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARC
CPU GAM33
New ISA
lcacc-req t
l t
CPU GAM
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free idMemory LCA
3.3. The core reserves (lcaccThe core reserves (lcacc--rsv) and waits.rsv) and waits.
lcacc free idy
(( ))
1717
Overall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARC
CPU GAM44
New ISA
lcacc-req t
l t
CPU GAM44
44 lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free idMemory LCA
44
4.4. The GAM ACK the reservation and send the core ID to acceleratorThe GAM ACK the reservation and send the core ID to accelerator
lcacc free idy
1818
Overall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARC
CPU GAM New ISA
lcacc-req t
l t
CPU GAM
55lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free idMemory Task Task descriptiondescription
Accelerator
55
5.5. The core shares a task description with the accelerator through memory and The core shares a task description with the accelerator through memory and
lcacc free idy descriptiondescription
p g yp g ystarts it (lcaccstarts it (lcacc--cmd).cmd).•• Task description consists of:Task description consists of:
oo Function ID and input parametersFunction ID and input parametersoo Function ID and input parametersFunction ID and input parametersoo Input/output addresses and stridesInput/output addresses and strides
1919
Overall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARC
CPU GAM New ISA
lcacc-req t
l t
CPU GAM
66
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free idMemory Task Task descriptiondescription LCA
66
6.6. The accelerator reads the task description, and begins workingThe accelerator reads the task description, and begins working
lcacc free idy descriptiondescription
p , g gp , g g•• Overlapped Read/Write from/to Memory and ComputeOverlapped Read/Write from/to Memory and Compute•• Interrupting core when TLB miss Interrupting core when TLB miss
2020
Overall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARC
CPU GAM New ISA
lcacc-req t
l t
CPU GAM
77 lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free idMemory Task Task descriptiondescription LCA
77
7.7. When the accelerator finishes its current task it notifies the core.When the accelerator finishes its current task it notifies the core.
lcacc free idy descriptiondescription
2121
Overall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARCOverall Communication Scheme in ARC
CPU GAM88
New ISA
lcacc-req t
l t
CPU GAM
lcacc-rsrv t, e
lcacc-cmd id, f, addr
lcacc-free idMemory Task Task descriptiondescription LCA
8.8. The core then sends a message to the GAM freeing the accelerator (lcaccThe core then sends a message to the GAM freeing the accelerator (lcacc--free).free).
lcacc free idy descriptiondescription
g g (g g ( ))
2222
Accelerator Chaining and CompositionAccelerator Chaining and CompositionAccelerator Chaining and CompositionAccelerator Chaining and Composition
♦♦ ChainingChaining Accelerator1 Accelerator2
Efficient accelerator to accelerator communication Scratchpad Scratchpad
DMA controller DMA controller
♦♦ Composition Composition C t ti i t l Constructing virtual accelerators
M-point1D FFT
M-point1D FFT
3D FFTvirtualizationvirtualization
1D FFT 1D FFT
N-point
M-point1D FFT
M-point1D FFT
2323
2D FFT
Accelerator VirtualizationAccelerator VirtualizationAccelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--
level functionalitylevel functionalitylevel functionalitylevel functionality
♦♦ Implementation viaImplementation viaMonolithic acceleratorMonolithic acceleratorDistributed accelerators composed to a virtual accelerator Software decomposition libraries
♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2 D FFT using 2 4D FFT using 2 4 point 1point 1 D FFT D FFT
2424
Accelerator VirtualizationAccelerator VirtualizationAccelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--
level functionalitylevel functionalitylevel functionalitylevel functionality
♦♦ Implementation viaImplementation viaMonolithic acceleratorMonolithic acceleratorDistributed accelerators composed to a virtual accelerator Software decomposition libraries
♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2 D FFT using 2 4D FFT using 2 4 point 1point 1 D FFT D FFT
Step 1: 1D FFT on Row 1 and Row 2Step 1: 1D FFT on Row 1 and Row 2
2525
Accelerator VirtualizationAccelerator VirtualizationAccelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--
level functionalitylevel functionalitylevel functionalitylevel functionality
♦♦ Implementation viaImplementation viaMonolithic acceleratorMonolithic acceleratorDistributed accelerators composed to a virtual accelerator Software decomposition libraries
♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2 D FFT using 2 4D FFT using 2 4 point 1point 1 D FFT D FFT
Step 2: 1D FFT on Row 3 and Row 4Step 2: 1D FFT on Row 3 and Row 4
2626
Accelerator VirtualizationAccelerator VirtualizationAccelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--
level functionalitylevel functionalitylevel functionalitylevel functionality
♦♦ Implementation viaImplementation viaMonolithic acceleratorMonolithic acceleratorDistributed accelerators composed to a virtual accelerator Software decomposition libraries
♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2 D FFT using 2 4D FFT using 2 4 point 1point 1 D FFT D FFT
Step 3: 1D FFT on Col 1 and Col 2Step 3: 1D FFT on Col 1 and Col 2
2727
Accelerator VirtualizationAccelerator VirtualizationAccelerator VirtualizationAccelerator Virtualization♦♦ Application programmer or compilation framework selects highApplication programmer or compilation framework selects high--
level functionalitylevel functionalitylevel functionalitylevel functionality
♦♦ Implementation viaImplementation viaMonolithic acceleratorMonolithic acceleratorDistributed accelerators composed to a virtual accelerator Software decomposition libraries
♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2--D FFT using 2 4D FFT using 2 4--point 1point 1--D FFT D FFT ♦♦ Example: Implementing a 4x4 2Example: Implementing a 4x4 2 D FFT using 2 4D FFT using 2 4 point 1point 1 D FFT D FFT
Step 4: 1D FFT on Col 3 and Col 4Step 4: 1D FFT on Col 3 and Col 4
2828
LightLight--Weight Interrupt SupportWeight Interrupt SupportLightLight Weight Interrupt SupportWeight Interrupt Support
CPU GAMCPU GAM
LCA
2929
LightLight--Weight Interrupt SupportWeight Interrupt SupportLightLight Weight Interrupt SupportWeight Interrupt Support
CPU GAMCPU GAM
Request/Reserve Request/Reserve C fi ti d C fi ti d
LCA
Confirmation and Confirmation and NACKNACKSent by GAMSent by GAM
3030
LightLight--Weight Interrupt SupportWeight Interrupt SupportLightLight Weight Interrupt SupportWeight Interrupt Support
CPU GAMCPU GAM
TLB MissTLB MissT k DT k D
LCA
Task DoneTask Done
3131
LightLight--Weight Interrupt SupportWeight Interrupt SupportLightLight Weight Interrupt SupportWeight Interrupt Support
CPU GAMCPU GAM
TLB MissTLB MissT k DT k D
LCA
Task DoneTask Done
Core Sends Logical Addresses to LCACore Sends Logical Addresses to LCALCA keeps a small TLB for the addresses that it is working onLCA keeps a small TLB for the addresses that it is working on
3232
LightLight--Weight Interrupt SupportWeight Interrupt SupportLightLight Weight Interrupt SupportWeight Interrupt Support
CPU GAMCPU GAM
TLB MissTLB MissT k DT k D
LCA
Task DoneTask Done
Core Sends Logical Addresses to LCACore Sends Logical Addresses to LCALCA keeps a small TLB for the addresses that it is working onLCA keeps a small TLB for the addresses that it is working on
Why Logical Address?Why Logical Address?11-- Accelerators can work on irregular addresses (e.g. indirect addressing)Accelerators can work on irregular addresses (e.g. indirect addressing)
3333
22-- Using large page size can be a solution but will effect other applications Using large page size can be a solution but will effect other applications
LightLight--Weight Interrupt SupportWeight Interrupt SupportLightLight Weight Interrupt SupportWeight Interrupt Support
CPU GAMCPU GAM
It’s expensive to It’s expensive to h dl th h dl th
LCA
handle the handle the interrupts via OSinterrupts via OS
OperationLatency to switch to ISR and back (# Cycles)
1 core 2 cores 4 cores 8 cores 16 cores
Interrupt 16 K 20 K 24 K 27 K 29 K
3434
LightLight--Weight Interrupt SupportWeight Interrupt SupportLightLight Weight Interrupt SupportWeight Interrupt Support
CPU GAMLWCPU GAM
Extending the core Extending the core ith li htith li ht i ht i ht
WI
LCA
with a lightwith a light--weight weight interrupt supportinterrupt support
3535
LightLight--Weight Interrupt SupportWeight Interrupt SupportLightLight Weight Interrupt SupportWeight Interrupt Support
CPU GAMLWCPU GAM
Extending the core Extending the core ith li htith li ht i ht i ht
WI
LCA
with a lightwith a light--weight weight interrupt supportinterrupt support
Two main components added:Two main components added:A table to store ISR info
An interrupt controller to queue and prioritize incoming interrupt packets
Each thread registers: Each thread registers: ggAddress of the ISR and its arguments and lw-int source
Limitations:Limitations:
3636
Only can be used when running the same thread which LW interrupt belongs to
OS-handled interrupt otherwise
Programming interface to ARCProgramming interface to ARCProgramming interface to ARCProgramming interface to ARCPlatform creationPlatform creation
Application Mapping & DevelopmentApplication Mapping & Development
3737
Evaluation methodologyEvaluation methodologyEvaluation methodologyEvaluation methodology♦♦ BenchmarksBenchmarks
Medical imagingMedical imaging
Vision & Navigation
3838
Application Domain: Medical Image ProcessingApplication Domain: Medical Image Processing
compressive compressive
pp g gpp g gon
stru
ctio
non
stru
ctio
n
<<2
:theoryNyquist -Shannon classical rate a
at sampled be can and sparsity,exhibit images Medical
sensingsensing
t t l i ti l t t l i ti l
ng
ng
S
reco
reco ∑∑
∀
+voxels
2
points sampled
)(-ARmin ugradSuu
λ
total variational total variational algorithmalgorithmde
nois
ide
nois
i
h
zyS
i,jvolumevoxel
ji
S
kkk
eiZ
wjfwi
∑
=−⎟⎟
⎠
⎞
⎜⎜
⎝
⎛=∀
=−
−
∈∑
1
21
2
j
2, )(
1 ,2)()(u :voxel σ
fluid fluid registrationregistrationre
gist
ratio
nre
gist
ratio
n
( ) [ ] )()()()( uxTxRuxTvv
uvtu
v
−∇−−−=⋅∇∇++Δ
∇⋅+∂∂
=
ημμ
level set level set
registrationregistration
ntat
ion
ntat
ion
( ) [ ] )()()()( uxTxRuxTvv ∇∇∇++Δ ημμ
div)(F ⎥⎤
⎢⎡
⎟⎟⎞
⎜⎜⎛ ∇
+∇=∂ ϕλφϕϕ
data level set level set methodsmethodsse
gmen
segm
enys
isys
is
{ }0t)(x, : xvoxels)(surface
div),(F
==
⎥⎥⎦⎢
⎢⎣
⎟⎟⎠
⎜⎜⎝ ∇
+∇∂
ϕ
ϕλφϕ
t
datat
∂v
3939
anal
yan
aly
∑∑==
+∂
∂+
∂∂
−=∂∂
+∂∂
+Δ+−∇=∇⋅+∂∂
3
12
23
1
),(
),()(
ji
j
ij
j ij
ij
i txfx
vv
xp
x
vv
t
v
txfvpvvtv
υ
υ NavierNavier--StokesStokesequationsequations
Area OverheadArea OverheadArea OverheadArea Overhead
Core NoC L2 Deblur Denoise Segmentation RegistrationCore NoC L2 Deblur Denoise Segmentation RegistrationNumber of instance/Size 1 1 8MB 4 4 4 4Area(mm^2) 3 1 0 3 38 8 10 2 4 2 4 9 17 6Area(mm^2) 3.1 0.3 38.8 10.2 4.2 4.9 17.6Percentage (%) 3.9 0.4 49.0 12.9 5.3 6.2 22.3
♦♦ AutoESL (from Xilinx) for C to RTL synthesisAutoESL (from Xilinx) for C to RTL synthesis
♦♦ Synopsys for ASIC synthesisSynopsys for ASIC synthesis32 nm Synopsys Educational library32 nm Synopsys Educational library
♦♦ CACTI for L2CACTI for L2
♦♦ Orion for NoCOrion for NoC
4040
Experimental Results Experimental Results –– PerformancePerformance(N cores, N threads, N accelerators)(N cores, N threads, N accelerators)(N cores, N threads, N accelerators)(N cores, N threads, N accelerators)
400Speedup over SW-Only
Performance improvement Performance improvement over SW only approaches:over SW only approaches:150
200250300350
ain
(X)
Registration
Deblur over SW only approaches:over SW only approaches:on average 168x, up to 380xon average 168x, up to 380x
050
100150
1 2 4 8 16
G Deblur
Denoise
Segmentation
1 2 4 8 16Configuration (N cores, N threads, N accelerators)
350Speedup over OS-based
Performance improvement Performance improvement over OS based approaches:over OS based approaches: 150
200
250
300
350
in (X
)
Registration
D blover OS based approaches:over OS based approaches:on average 51x, up to 292xon average 51x, up to 292x
0
50
100
150Ga Deblur
Denoise
Segmentation
4141
1 2 4 8 16Configuration (N cores, N threads, N accelerators)
Experimental Results Experimental Results –– Energy Energy (N cores N threads N accelerators)(N cores N threads N accelerators)(N cores, N threads, N accelerators)(N cores, N threads, N accelerators)
700
Energy gain over SW-only version
Energy improvement Energy improvement over SWover SW--only approaches:only approaches:300
400
500
600
Gai
n (X
)
Registration
Deblur over SWover SW only approaches:only approaches:on average 241x, up to 641xon average 241x, up to 641x
0
100
200
1 2 4 8 16
G
Denoise
Segmentation
1 2 4 8 16Configuration (N cores, N threads, N accelerators)
70
Energy gain over OS-based version
Energy improvement Energy improvement 40
50
60
70
n (X
)
Registration
over OSover OS--based approaches:based approaches:on average 17x, up to 63xon average 17x, up to 63x
0
10
20
30Gai
n
Deblur
Denoise
Segmentation
4242
01 2 4 8 16
Configuration (N cores, N threads, N accelerators)
What are the Problems with ARC? What are the Problems with ARC? What are the Problems with ARC? What are the Problems with ARC?
♦♦ Dedicated accelerators are inflexible Dedicated accelerators are inflexible A LCA b l f l ith d iAn LCA may be useless for new algorithms or new domains
Often under-utilized
LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM
• Unused when the accelerator is unused
♦♦ We want flexibility and better resource utilization We want flexibility and better resource utilization Solution: CHARM
♦♦ Private SPM is wastefulPrivate SPM is wastefulSolution: BiN
4343
A Composable Heterogeneous AcceleratorA Composable Heterogeneous Accelerator--Rich Rich Microprocessor (CHARM) [ISLPED’12]Microprocessor (CHARM) [ISLPED’12]Microprocessor (CHARM) [ISLPED 12]Microprocessor (CHARM) [ISLPED 12]
♦♦ MotivationMotivationGreat deal of data parallelism p
• Tasks performed by accelerators tend to have a great deal of data parallelismVariety of LCAs with possible overlap
• Utilization of any particular LCA being somewhat sporadicIt is expensive to have both:It is expensive to have both:
• Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism
Overlap in functionality
• LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs)
♦♦ IdeaIdeaFlexible accelerator building blocks (ABB) that can be composed into accelerators
4444
Flexible accelerator building blocks (ABB) that can be composed into accelerators
♦♦ Leverage economy of scaleLeverage economy of scale
Micro Architecture of CHARMMicro Architecture of CHARMMicro Architecture of CHARMMicro Architecture of CHARM♦♦ ABBABB
Accelerator building blocks (ABB) cce e ato bu d g b oc s ( )Primitive components that can be composed into accelerators
♦♦ ABB islandsABB islands♦♦ ABB islandsABB islandsMultiple ABBsShared DMA controller, SPM and NoC interface
♦♦ ABCABCAccelerator Block Composer (ABC)
• To orchestrate the data flow between ABBs to create a virtual accelerator
• Arbitrate requests from cores
♦♦ Other componentsOther componentsCores
4545
L2 BanksMemory controllers
An Example of ABB Library (for Medical Imaging)An Example of ABB Library (for Medical Imaging)An Example of ABB Library (for Medical Imaging)An Example of ABB Library (for Medical Imaging)
Internal Internal Internal Internal
of Polyof Poly
4646
Example of ABB FlowExample of ABB Flow--Graph (Denoise)Graph (Denoise)Example of ABB FlowExample of ABB Flow Graph (Denoise)Graph (Denoise)
22
4747
Example of ABB FlowExample of ABB Flow--Graph (Denoise)Graph (Denoise)Example of ABB FlowExample of ABB Flow Graph (Denoise)Graph (Denoise)
22
‐‐ ‐‐ ‐‐ ‐‐ ‐‐ ‐‐
** ** ** ** ** **++ ++ ++
++
++
sqrtsqrt
//
4848
1/x1/x
Example of ABB FlowExample of ABB Flow--Graph (Denoise)Graph (Denoise)Example of ABB FlowExample of ABB Flow Graph (Denoise)Graph (Denoise)
22
‐‐ ‐‐ ‐‐ ‐‐ ‐‐ ‐‐ ABB1: PolyABB1: Poly
** ** ** ** ** **++ ++ ++
++
++
ABB2: PolyABB2: Poly
sqrtsqrt
//
ABB3: SqrtABB3: Sqrt
4949
1/x1/x ABB4: InvABB4: Inv
Example of ABB FlowExample of ABB Flow--Graph (Denoise)Graph (Denoise)Example of ABB FlowExample of ABB Flow Graph (Denoise)Graph (Denoise)
22
‐‐ ‐‐ ‐‐ ‐‐ ‐‐ ‐‐ABB1:PolyABB1:Poly
** ** ** ** ** **++ ++ ++
yy
++
++ABB2: PolyABB2: Poly
sqrtsqrt
//
ABB3: SqrtABB3: Sqrt
ABB4 IABB4 I
5050
1/x1/xABB4: InvABB4: Inv
LCA Composition ProcessLCA Composition ProcessLCA Composition ProcessLCA Composition Process
ABB ABB ABB ABB xx xx
ISLAND1ISLAND1 ISLAND2ISLAND2yy ww
ABBABB ABB ABB zz yy
ISLAND3ISLAND3 ISLAND4ISLAND4ww zz
5151
LCA Composition ProcessLCA Composition ProcessLCA Composition ProcessLCA Composition Process
1.1. Core initiationCore initiationCore sends the task description: task flow-graph of the desired LCA to ABC together with polyhedral space for input and output
ABB ABB ABB ABB xx xxT k d i tiT k d i ti ISLAND1ISLAND1 ISLAND2ISLAND2
yy wwx
Task descriptionTask description
ABBABB ABB ABB zz yyy z
ISLAND3ISLAND3 ISLAND4ISLAND4ww zz
10x10 input and output10x10 input and output
5252
LCA Composition ProcessLCA Composition ProcessLCA Composition ProcessLCA Composition Process
2.2. TaskTask--flow parsing and taskflow parsing and task--list creationlist creationABC th t k fl h d b k th t ABC parses the task-flow graph and breaks the request into a set of tasks with smaller data size and fills the task list
ABB ABB ABB ABB xx xx
ISLAND1ISLAND1 ISLAND2ISLAND2yy ww
ABC generates internallyABC generates internally
ABBABB ABB ABB zz yy
ISLAND3ISLAND3 ISLAND4ISLAND4ww zz
Needed ABBs: “x”, “y”, “z”Needed ABBs: “x”, “y”, “z”
With task size of 5x5 block, With task size of 5x5 block,
5353
ABC generates 4 tasksABC generates 4 tasks
LCA Composition ProcessLCA Composition Processpp3.3. Dynamic ABB mappingDynamic ABB mapping
ABC uses a pattern matching algorithm to p g gassign ABBs to islandsFills the composed LCA table and resource allocation table
ABB ABB ABB ABB xx xx
ISLAND1ISLAND1 ISLAND2ISLAND2yy ww
Island ID
ABB Type
ABB ID Status
ABBABB ABB ABB zz yy
1 x 1 Free
1 y 1 Free
2 x 1 Free
ISLAND3ISLAND3 ISLAND4ISLAND4ww zz
2 w 1 Free
3 z 1 Free
3 w 1 Free
5454
4 y 1 Free
4 z 1 Free
LCA Composition ProcessLCA Composition ProcessLCA Composition ProcessLCA Composition Process3.3. Dynamic ABB mappingDynamic ABB mapping
ABC uses a pattern matching algorithm to p g gassign ABBs to islandsFills the composed LCA table and resource allocation table
ABB ABB ABB ABB xx xx
ISLAND1ISLAND1 ISLAND2ISLAND2yy ww
Island ID
ABB Type
ABB ID Status
ABBABB ABB ABB zz yy
1 x 1 Busy
1 y 1 Busy
2 x 1 Free
ISLAND3ISLAND3 ISLAND4ISLAND4ww zz
2 w 1 Free
3 z 1 Busy
3 w 1 Free
5555
4 y 1 Free
4 z 1 Free
LCA Composition ProcessLCA Composition ProcessLCA Composition ProcessLCA Composition Process4.4. LCA cloningLCA cloning
Repeat to generate more LCAs if ABBs are p gavailable
ABB ABB ABB ABB xx xx
ISLAND1ISLAND1 ISLAND2ISLAND2yy ww
Core ID
ABB Type
ABB ID Status
ABBABB ABB ABB zz yy1 x 1 Busy
1 y 1 Busy
2 x 1 Busy
ISLAND3ISLAND3 ISLAND4ISLAND4ww zz
2 x 1 Busy
2 w 1 Free
3 z 1 Busy
3 w 1 Free
5656
3 w 1 Free
4 y 1 Busy
4 z 1 Busy
LCA Composition ProcessLCA Composition ProcessLCA Composition ProcessLCA Composition Process
5.5. ABBs finishing taskABBs finishing taskWh ABB fi i h th i l th ABC If DONEDONEWhen ABBs finish, they signal the ABC. If ABC has another task it sends otherwise it frees the ABBs
DONEDONE
ABB ABB ABB ABB xx xxIsland ABB ABB ID Status
ISLAND1ISLAND1 ISLAND2ISLAND2yy ww
ID Type
1 x 1 Busy
1 y 1 Busy
ABBABB ABB ABB zz yy2 x 1 Busy
2 w 1 Free
3 z 1 Busy
ISLAND3ISLAND3 ISLAND4ISLAND4ww zz
3 z 1 Busy
3 w 1 Free
4 y 1 Busy
4 z 1 Busy
5757
4 z 1 Busy
LCA Composition ProcessLCA Composition ProcessLCA Composition ProcessLCA Composition Process
5.5. ABBs being freedABBs being freedWh ABB fi i h it i l th ABC If When an ABB finishes, it signals the ABC. If ABC has another task it sends otherwise it frees the ABBs
ABB ABB ABB ABB xx xxIsland ABB ABB ID Status
ISLAND1ISLAND1 ISLAND2ISLAND2yy ww
ID Type
1 x 1 Busy
1 y 1 Busy
ABBABB ABB ABB zz yy
2 x 1 Free
2 w 1 Free
3 z 1 Busy
ISLAND3ISLAND3 ISLAND4ISLAND4ww zz
3 z 1 Busy
3 w 1 Free
4 y 1 Free
4 z 1 Free
5858
4 z 1 Free
LCA Composition ProcessLCA Composition ProcessLCA Composition ProcessLCA Composition Process
6.6. Core notified of end of taskCore notified of end of taskWh th LCA fi i h ABC i l th When the LCA finishes ABC signals the core
ABB ABB ABB ABB xx xxIsland ABB ABB ID Status
ISLAND1ISLAND1 ISLAND2ISLAND2yy ww
ID Type
1 x 1 Free
1 y 1 Free
DONEDONE
ABBABB ABB ABB zz yy
2 x 1 Free
2 w 1 Free
3 z 1 Free
ISLAND3ISLAND3 ISLAND4ISLAND4ww zz
3 z 1 Free
3 w 1 Free
4 y 1 Free
4 z 1 Free
5959
4 z 1 Free
ABC Internal DesignABC Internal DesignABC Internal DesignABC Internal Design♦♦ ABC subABC sub--componentscomponents
Resource Table(RT): To keep track of Accelerator Block Composer( ) pavailable/used ABBs Composed LCA Table (CLT): Eliminates the need to re-compose LCAsTask List (TL) To q e e the broken LCA
Composed LCA Table
DFG Interpreter
Cores
Task List (TL): To queue the broken LCA requests (to smaller data size)TLB: To service and share the translation requests by ABBs
LCA Table
k
Interpreter
LCA To ABBsq yTask Flow-Graph Interpreter (TFGI): Breaks the LCA DFG into ABBsLCA Composer (LC): Compose the LCA using available ABBs
Task ListLCA
ComposerTo ABBs
(allocate(allocate
using available ABBs
♦♦ ImplementationImplementationRT, CLT, TL and TLB are implemented using RAM
Resource Table
TLB
using RAMTFGI has a table to keep ABB types and an FSM to read task-flow-graph and comparesLC has an FSM to go over CLT and RT and
From ABBs(Done signal)(Done signal)
ABBs(TLB service)(TLB service)
6060
check mark the available ABBs
CHARM Software InfrastructureCHARM Software InfrastructureCHARM Software InfrastructureCHARM Software Infrastructure♦♦ ABB type extraction ABB type extraction
Input: compute intensive kernels Input: compute-intensive kernels from different application
Output: ABB Super-patterns
Currently semi-automatic
♦♦ ABB template mappingABB template mappingInput: Kernels + ABB types
Output: Covered kernels as an ABB flow graphABB flow-graph
♦♦ CHARM uProgram CHARM uProgram generationgenerationgenerationgeneration
Input: ABB flow-graph
Output:
6161
p
Evaluation MethodologyEvaluation MethodologyEvaluation MethodologyEvaluation Methodology
♦♦ Simics+GEMS based simulationSimics+GEMS based simulation
A Pil /Xili S f A Pil /Xili S f ♦♦ AutoPilot/Xilinx+ Synopsys for AutoPilot/Xilinx+ Synopsys for ABB/ABC/DMAABB/ABC/DMA--C synthesisC synthesis
♦♦ Cacti for memory synthesis (SPM)Cacti for memory synthesis (SPM)♦♦ Cacti for memory synthesis (SPM)Cacti for memory synthesis (SPM)
♦♦ Automatic flow to generate the CHARM Automatic flow to generate the CHARM software and simulation modulessoftware and simulation modules
♦♦ Case studiesCase studiesPhysical LCA sharing with Global Accelerator Manager (LCA+GAM)g ( )
Physical LCA sharing with ABC (LCA+ABC)
ABB composition and sharing with ABB composition and sharing with ABC (ABB+ABC)
♦♦ Medical imaging benchmarksMedical imaging benchmarks
6262
Denoise, Deblur, Segmentation and Registration
Area Overhead AnalysisArea Overhead AnalysisArea Overhead AnalysisArea Overhead Analysis
♦♦ AreaArea--equivalentequivalentTh t t l d b The total area consumed by the ABBs equals the total area of all LCAs required to area of all LCAs required to run a single instance of each benchmark
♦♦ Total CHARM area is 14% Total CHARM area is 14% of the 1cmx1cm chipof the 1cmx1cm chipof the 1cmx1cm chipof the 1cmx1cm chip
A bit less than LCA-based designg
6363
Results: Improvement Over LCAResults: Improvement Over LCA--based Designbased DesignResults: Improvement Over LCAResults: Improvement Over LCA based Designbased Design♦♦ N’p’ has N cores, N threads N’p’ has N cores, N threads
and N times area and N times area --equivalent equivalent 1.2
Normalized PerformanceLCA+GAM LCA+TD ABB+TD
and N times area and N times area --equivalent equivalent accelerators accelerators
♦♦ EnergyEnergy 0
0.6
0.8
1
.
♦♦ EnergyEnergy2.4X vs. LCA+GAM (max 4.7X)
1.6X vs. LCA+ABC (max 3.1X)0
0.2
0.4
1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p
♦♦ PerformancePerformance2.2X vs. LCA+GAM (max 3.8X)
Seg Deb Reg Den
Normalized EnergyLCA+GAM LCA+TD ABB+TD
1.6X vs. LCA+ABC (max 2.7X)
♦♦ ABB+ABC better energy and ABB+ABC better energy and performance performance 0.6
0.8
1
1.2
performance performance ABC starts composing ABBs to create new LCAs
0
0.2
0.4
1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p
6464
Creates more parallelism Seg Deb Reg Den
Results: Platform FlexibilityResults: Platform FlexibilityResults: Platform FlexibilityResults: Platform Flexibility♦♦ Two applications from two Two applications from two
unrelated domains to MIunrelated domains to MIunrelated domains to MIunrelated domains to MIComputer vision
• Log-Polar Coordinate Image g gPatches (LPCIP)
Navigation
• Extended Kalman Filter-based Simultaneous Localization and Mapping (EKF-SLAM)
MAX Benefit overpp g ( )
♦♦ Only one ABB is addedOnly one ABB is addedIndexed Vector Load
MAX Benefit over LCA+GAM 3.64X
AVG Benefit over LCA+GAM 2.46X
MAX Benefit over LCA+ABC 3.04X
AVG Benefit over LCA+ABC 2 05X
6565
LCA+ABC 2.05X
Examples of EnergyExamples of Energy--Efficient CustomizationEfficient CustomizationExamples of EnergyExamples of Energy Efficient CustomizationEfficient Customization
♦♦ Customization of processor coresCustomization of processor cores♦♦ Customization of onCustomization of on--chip memorychip memory♦♦ Customization of onCustomization of on--chip interconnectschip interconnects♦♦ Customization of onCustomization of on chip interconnectschip interconnects
6666
Memory Management for AcceleratorMemory Management for Accelerator--Rich Rich Architectures Architectures [ISLPED’2012][ISLPED’2012]Architectures Architectures [ISLPED 2012][ISLPED 2012]♦♦ Providing a private buffer for each accelerator is very inefficient. Providing a private buffer for each accelerator is very inefficient.
Large private buffers: occupy a considerable amount of chip area Large private buffers: occupy a considerable amount of chip area Large private buffers: occupy a considerable amount of chip area Large private buffers: occupy a considerable amount of chip area Small private buffers: less effective for reducing offSmall private buffers: less effective for reducing off--chip bandwidthchip bandwidth
♦♦ Not all accelerators are poweredNot all accelerators are powered--on at the same time on at the same time ♦♦ Not all accelerators are poweredNot all accelerators are powered on at the same time on at the same time Shared buffer [Lyonsy et al. TACO’12]Shared buffer [Lyonsy et al. TACO’12]Allocate the buffers in the cache onAllocate the buffers in the cache on--demand [demand [Fajardo et al.Fajardo et al. DAC’11DAC’11][Cong et al. ][Cong et al. ISLPED’11]ISLPED’11]ISLPED’11]ISLPED’11]
♦♦ Our solution Our solution BiN: A BufferBiN: A Buffer inin NUCA Scheme for AcceleratorNUCA Scheme for Accelerator Rich CMPsRich CMPsBiN: A BufferBiN: A Buffer--inin--NUCA Scheme for AcceleratorNUCA Scheme for Accelerator--Rich CMPsRich CMPs
6767
Buffer Size vs. OffBuffer Size vs. Off--chip Memory Access Bandwidthchip Memory Access Bandwidthp yp y♦♦ Buffer size Buffer size ↑ ↑ -- offoff--chip memory bandwidth chip memory bandwidth ↓↓: covering longer reuse distance [Cong et al. : covering longer reuse distance [Cong et al.
ICCAD’11]ICCAD’11]
♦♦ Buffer size vs. bandwidth curve: BBBuffer size vs. bandwidth curve: BB--CurveCurve
♦♦ Buffer utilization efficiencyBuffer utilization efficiencyDiff t f i l t Diff t f i l t Different for various accelerators Different for various accelerators
Different for various inputs for one acceleratorDifferent for various inputs for one accelerator
♦♦ Prior work: no consideration of global allocation at runtimePrior work: no consideration of global allocation at runtimeAccept fixedAccept fixed--size buffer allocation requestssize buffer allocation requests
Rely on the compiler to select a single, ‘best’ point in the BBRely on the compiler to select a single, ‘best’ point in the BB--CurveCurve
2,000,000
2,500,000
ccess
es
DenoiseDenoise
High buffer utilization efficiencyHigh buffer utilization efficiency
500,000
1,000,000
1,500,000
chip
mem
ory
ac
input image: cube(28)
input image: cube(52)
input image: cube(76)
High buffer utilization efficiencyHigh buffer utilization efficiency
6868
0
500,000
9 27 119 693
Buffer size (KB)
Off-
c
Low buffer utilization efficiencyLow buffer utilization efficiency
Resource FragmentationResource FragmentationResource FragmentationResource Fragmentation♦♦ Prior work allocates a Prior work allocates a contiguouscontiguous space to each buffer to simplify buffer accessspace to each buffer to simplify buffer access
♦♦ Requested buffers have unpredictable space demand and come in dynamically: Requested buffers have unpredictable space demand and come in dynamically: ♦♦ Requested buffers have unpredictable space demand and come in dynamically: Requested buffers have unpredictable space demand and come in dynamically: resource fragmentationresource fragmentation
♦♦ NUCA complicates buffer allocations in cacheNUCA complicates buffer allocations in cacheThe distance of the cache bank to the accelerator also mattersThe distance of the cache bank to the accelerator also matters
♦♦ To support fragmented resources: paged allocationTo support fragmented resources: paged allocationAnalogo s to a t pical OSAnalogo s to a t pical OS managed irt al memormanaged irt al memorAnalogous to a typical OSAnalogous to a typical OS--managed virtual memorymanaged virtual memory
♦♦ Challenges:Challenges:Large private page tables have high energy and area overheadLarge private page tables have high energy and area overheadg p p g g gyg p p g g gyIndirect access to a shared page table has high latency overheadIndirect access to a shared page table has high latency overhead
Shared buffer space: 15KBShared buffer space: 15KBShared buffer space: 15KBShared buffer space: 15KB
Buffer 1: 5KB, duration: 1K cyclesBuffer 1: 5KB, duration: 1K cycles
Buffer 2: 5KB duration: 2K cyclesBuffer 2: 5KB duration: 2K cycles
6969
Buffer 2: 5KB, duration: 2K cyclesBuffer 2: 5KB, duration: 2K cycles
Buffer 3: 10KB, duration: 2K cycles Buffer 3: 10KB, duration: 2K cycles
BiN: BufferBiN: Buffer--inin--NUCANUCABiN: BufferBiN: Buffer inin NUCANUCA♦♦ Goals of BufferGoals of Buffer--inin--NUCA (BiN)NUCA (BiN)
Towards optimal onTowards optimal on chip storage utilizationchip storage utilizationTowards optimal onTowards optimal on--chip storage utilizationchip storage utilization
Dynamically allocate buffer space in the NUCA among a large number of competing Dynamically allocate buffer space in the NUCA among a large number of competing accelerators accelerators
♦♦ Contributions of BiN:Contributions of BiN:Dynamic intervalDynamic interval--based global (DIG) buffer allocation: address the buffer resource based global (DIG) buffer allocation: address the buffer resource contentioncontention
Flexible paged buffer allocation: address the buffer resource fragmentation Flexible paged buffer allocation: address the buffer resource fragmentation
7070
AcceleratorAccelerator--Rich CMP with BiNRich CMP with BiNAcceleratorAccelerator Rich CMP with BiNRich CMP with BiN♦♦Overall architecture of ARC [Cong et al. DAC Overall architecture of ARC [Cong et al. DAC 2011] with BiN2011] with BiN]]
Cores (with private L1 caches)Cores (with private L1 caches)AcceleratorsAccelerators
Accelerator logicAccelerator logic●● Accelerator logicAccelerator logic●● DMADMA--controller controller ●● A small storage for the control structureA small storage for the control structureThe accelerator and BiN manager (ABM)The accelerator and BiN manager (ABM)The accelerator and BiN manager (ABM)The accelerator and BiN manager (ABM)●● Arbitration over accelerator resourcesArbitration over accelerator resources●● Allocates buffers in the shared cache (BiN Allocates buffers in the shared cache (BiN
management)management)NUCA (shared L2 cache) banksNUCA (shared L2 cache) banks
(1) The core sends the accelerator and buffer allocation request with the BB-Curve to ABM.
(2) ABM performs accelerator allocation, buffer allocation(2) ABM performs accelerator allocation, buffer allocationin NUCA, and acknowledges the core.
(3) The core sends the control structure to the accelerator.(4) The accelerator starts working with its allocated buffer.(5) Th l t i l t th h it fi i h
7171
(5) The accelerator signals to the core when it finishes.(6) The core sends the free-resource message to ABM.(7) ABM frees the accelerator and buffer in NUCA.
Dynamic IntervalDynamic Interval--based Global (DIG) Allocationbased Global (DIG) AllocationDynamic IntervalDynamic Interval based Global (DIG) Allocationbased Global (DIG) Allocation♦♦Perform global allocation for buffer allocation requests in an intervalPerform global allocation for buffer allocation requests in an interval
Keep the interval short (10K cycles): Minimize waitingKeep the interval short (10K cycles): Minimize waiting--inin--intervalintervalKeep the interval short (10K cycles): Minimize waitingKeep the interval short (10K cycles): Minimize waiting inin intervalinterval
If 8 or more buffer requests, the DIG allocation will start immediatelyIf 8 or more buffer requests, the DIG allocation will start immediately
♦♦An example: 2 buffer allocation requestsAn example: 2 buffer allocation requests
Each point (b, s)Each point (b, s)
●● s: buffer sizes: buffer size
●● b: corresponding bandwidth requirement at sb: corresponding bandwidth requirement at sb: corresponding bandwidth requirement at sb: corresponding bandwidth requirement at s
●● Buffer utilization efficiency at each point: Buffer utilization efficiency at each point:
The points are in nonThe points are in non--decreasing order of buffer sizedecreasing order of buffer size
( 1) ( 1)( ) /( )ij i j ij i jb b s s− −− −
10 10( , )b s00 00( , )b s
( )b b ( )b b
00s 10s
01s01 00 11 10
01 00 11 10
( ) ( )
( ) ( )
b b b b
s s s s
− −>
− −
11 11( , )b s
12 12( , )b s
04 04( , )b s
01 01( , )b s
02 02( , )b s
01 00
01 00
( )
( )
b b
s s
−−
02 01( )b b−
11 10
11 10
( )
( )
b b
s s
−−
12 11( )b b−
01 01 00 11 10( ) ( )
11 10 02 01
11 10 02 01
( ) ( )
( ) ( )
b b b b
s s s s
− −>
− −
02 0112 11
12 11 02 01
( )( )
( ) ( )
b bb b
s s s s
−−>
− −
11s
12s
7272
04 04( , )b s02 01
02 01
( )
( )s s−12 11
12 11
( )
( )
b b
s s−12 11 02 01( ) ( )
02s
Flexible Paged AllocationFlexible Paged AllocationFlexible Paged AllocationFlexible Paged Allocation♦♦ Set the page size according to buffer size: FixedSet the page size according to buffer size: Fixed total number of pages for each buffer total number of pages for each buffer
♦♦ BiN manager locally keep the information of the current contiguous buffer space in each L2 bankBiN manager locally keep the information of the current contiguous buffer space in each L2 bank♦♦ BiN manager locally keep the information of the current contiguous buffer space in each L2 bankBiN manager locally keep the information of the current contiguous buffer space in each L2 bank
Since all of the buffer allocation and free operations are performed by BiN manager Since all of the buffer allocation and free operations are performed by BiN manager
♦♦ Allocation: starting from the nearest L2 bank to this accelerator, to the farthestAllocation: starting from the nearest L2 bank to this accelerator, to the farthest
♦♦ We allow the last page (source of page fragments) of a buffer to be smaller than the other We allow the last page (source of page fragments) of a buffer to be smaller than the other pages of this bufferpages of this buffer
No impact on the page table look p No impact on the page table look p No impact on the page table lookup No impact on the page table lookup
The max page fragment will be smaller than the minThe max page fragment will be smaller than the min--page page
The page fragments do not waste capacity since they can be used by cacheThe page fragments do not waste capacity since they can be used by cache
7373
Buffer Allocation in NUCABuffer Allocation in NUCABuffer Allocation in NUCABuffer Allocation in NUCA♦♦ Total buffer sizeTotal buffer size
Buffers are allocated onBuffers are allocated on--demanddemandBuffers are allocated onBuffers are allocated on--demanddemand
Set an upperSet an upper--bound of the total buffer size: reduce the impact on cachebound of the total buffer size: reduce the impact on cache
StateState--ofof--thethe--art cache partitioning can be used to dynamically tune the upper boundart cache partitioning can be used to dynamically tune the upper bound●● E.g. [Qureshi & Patt, MICRO’06]E.g. [Qureshi & Patt, MICRO’06]
1.001st bar: 2p-28, 2nd bar: 2p-52, 3rd bar: 2p-76, 4th bar: 2p-100
acity
dealII gcc gobmk hmmer milc namd omnetpp perl povray sphinxxalancbmk0.00
0.25
0.50
0.75
Per
cent
of c
apa
♦♦ Buffer allocations among cache banksBuffer allocations among cache banks
Distribute the imposed upper bound onto cache banksDistribute the imposed upper bound onto cache banks
dealII gcc gobmk hmmer milc namd omnetpp perl povray sphinxxalancbmkCache BiN upper bound
Distribute the imposed upper bound onto cache banksDistribute the imposed upper bound onto cache banks
●● Avoid creating high contention in a particular cache bankAvoid creating high contention in a particular cache bank
StateState--ofof--thethe--art NUCA management schemes can be used to further mitigate contention art NUCA management schemes can be used to further mitigate contention i t d d b b ff ll tii t d d b b ff ll ti
7474
introduced by buffer allocationintroduced by buffer allocation●● E.g., page reE.g., page re--coloring scheme [Cho & Jin, MICRO’06]coloring scheme [Cho & Jin, MICRO’06]
Hardware Overhead of BiN ManagementHardware Overhead of BiN ManagementHardware Overhead of BiN ManagementHardware Overhead of BiN Management♦♦Storage: Storage:
32 SRAMs: contiguous spaces info in cache banks32 SRAMs: contiguous spaces info in cache banks77 entry: at most 7 contiguous spaces in a 64KB cache bank with a minentry: at most 7 contiguous spaces in a 64KB cache bank with a min page of 4KBpage of 4KB●● 77--entry: at most 7 contiguous spaces in a 64KB cache bank with a minentry: at most 7 contiguous spaces in a 64KB cache bank with a min--page of 4KBpage of 4KB
●● 14 bits wide (10 bits: the starting block ID, 4 bits: the space length in terms of min14 bits wide (10 bits: the starting block ID, 4 bits: the space length in terms of min--page)page)
8 SRAMs: the BB8 SRAMs: the BB--curves of the buffer requests curves of the buffer requests ●● 88--entry: at most 8 BBentry: at most 8 BB--Curve pointsCurve points●● 5B wide: 2B for the buffer size and 3B for the buffer usage efficiency5B wide: 2B for the buffer size and 3B for the buffer usage efficiency●● 5B wide: 2B for the buffer size and 3B for the buffer usage efficiency5B wide: 2B for the buffer size and 3B for the buffer usage efficiency
Total storage overhead: 768B, area: 3,282umTotal storage overhead: 768B, area: 3,282um22 (HP Cacti @ 32nm)(HP Cacti @ 32nm)
♦♦Logic: Logic:
9,725um9,725um22 @ 2GHz (Synopsys DC, SAED library @ 32nm)@ 2GHz (Synopsys DC, SAED library @ 32nm),, @ ( y p y , y @ )@ ( y p y , y @ )
An average latency of 0.6us (1.2K cycles @ 2GHz) to perform the buffer allocationsAn average latency of 0.6us (1.2K cycles @ 2GHz) to perform the buffer allocations
♦♦The total area of the buffer allocation module is less than 0.01% for a medium size 1cmThe total area of the buffer allocation module is less than 0.01% for a medium size 1cm22 chip chip
b b
7575
( 1)
( 1)
ij i j
ij i j
b b
s s−
−
−
−ijs
Simulation Infrastructure & BenchmarksSimulation Infrastructure & BenchmarksSimulation Infrastructure & BenchmarksSimulation Infrastructure & Benchmarks♦♦ Extend the fullExtend the full--system cyclesystem cycle--accurate Simics+GEMS simulation platform to support ARC+BiNaccurate Simics+GEMS simulation platform to support ARC+BiN
CPU 4 Ultra-SPARC III-i cores @ 2GHzCPU 4 Ultra SPARC III i cores @ 2GHz
L1 data/instruction cache 32KB for each core, 4-way set-associative, 64B cache block, 3-cycle access latency, pseudo-LRU, MESI directory coherence by L2 cache
L2 cache (NUCA) 2MB, 32 banks, each bank is 64KB, 8-way set-associative, 64B cache block, 6-cycle l t d LRUaccess latency, pseudo-LRU
Network on chip 4X8 mesh, XY routing, wormhole switching, 3-cycle router latency, 1-cycle link latency
Main memory 4GB, 1000-cycle access latency
♦♦Benchmarks: 4 medical imaging applications in a Benchmarks: 4 medical imaging applications in a medical imaging pipelinemedical imaging pipeline
Use the accelerator extraction method of [Cong et.al., DAC’12]Use the accelerator extraction method of [Cong et.al., DAC’12]
Accelerator is synthesized by AutoESL from XilinxAccelerator is synthesized by AutoESL from Xilinx
♦♦Experimental benchmark naming conventionExperimental benchmark naming convention
mPmP--n: m copies of pipelines, the input to each is a unique n^3 pixels image n: m copies of pipelines, the input to each is a unique n^3 pixels image
7676
●● No Fragmentation: Used to show the gain of DIG allocation only No Fragmentation: Used to show the gain of DIG allocation only
mPmP--mix: m copies of pipelines, the inputs are randomly selected mix: m copies of pipelines, the inputs are randomly selected ●● Fragmentation occurs: Used to show the gain of both DIG and paged allocationFragmentation occurs: Used to show the gain of both DIG and paged allocation
Reference Design SchemesReference Design SchemesReference Design SchemesReference Design Schemes♦♦ Accelerator Store (AS) [Lyonsy, et al. TACO’12]Accelerator Store (AS) [Lyonsy, et al. TACO’12]
Separate cache and shared buffer moduleSeparate cache and shared buffer moduleSeparate cache and shared buffer moduleSeparate cache and shared buffer module
Set the buffer size 32% larger than maximum buffer size in BiN: overhead of bufferSet the buffer size 32% larger than maximum buffer size in BiN: overhead of buffer--inin--cachecache
Partition the shared buffer into 32 banks distributed them to the 32 NoC nodesPartition the shared buffer into 32 banks distributed them to the 32 NoC nodes
♦♦ BiC [BiC [Fajardo et al DAC’11Fajardo et al DAC’11]]♦♦ BiC [BiC [Fajardo, et al. DAC 11Fajardo, et al. DAC 11]]BiC dynamically allocates contiguous cache space to a bufferBiC dynamically allocates contiguous cache space to a buffer
Upper bound: limiting buffer allocation to at most half of each cache bankUpper bound: limiting buffer allocation to at most half of each cache bank
B ff lti l h b k B ff lti l h b k Buffers can span multiple cache banks Buffers can span multiple cache banks
♦♦ BiNBiN--PagedPagedOnly has the proposed paged allocation scheme Only has the proposed paged allocation scheme
♦♦ BiNBiN--Dyn Dyn Based on BiNBased on BiN--Paged, it also performs dynamic allocation without consideration of near future buffer Paged, it also performs dynamic allocation without consideration of near future buffer requestsrequestsqq
It responds to a request immediately by greedily satisfying the request with the current available resourcesIt responds to a request immediately by greedily satisfying the request with the current available resources
♦♦ BiNBiN--FullFullThis is the entire proposed BiN schemeThis is the entire proposed BiN scheme
7777
This is the entire proposed BiN schemeThis is the entire proposed BiN scheme
Impact of Dynamic IntervalImpact of Dynamic Interval--based Global Allocationbased Global Allocationp yp y♦♦ BiNBiN--Full consistently outperforms Full consistently outperforms
the other schemes the other schemes 1.2
1.4
methe other schemes the other schemes
The only exception: 4PThe only exception: 4P--mix3mix3
●● 1.32X larger capacity of the AS 1.32X larger capacity of the AS 0.4
0.6
0.8
1.0
1.2
orm
aliz
ed R
un ti
m
can accommodate all buffer can accommodate all buffer requestsrequests
♦♦ Overall, compared to the Overall, compared to the
0.0
0.2
1P-2
8
1P-5
2
1P-7
6
1P-1
002P
-28
2P-5
2
2P-7
6
2P-1
004P
-28
4P-5
2
4P-7
6
4P-1
00
4P-m
ix1
4P-m
ix2
4P-m
ix3
4P-m
ix4
4P-m
ix5
4P-m
ix6
N
BiC BiN-Paged BiN-Dyn BiN-Full, p, paccelerator store and BiC, BiNaccelerator store and BiC, BiN--Full Full reduces the runtime reduction by reduces the runtime reduction by 32% d 3 % i l32% d 3 % i l
BiC BiN Paged BiN Dyn BiN Full
1 2
Comparison results of runtime
32% and 35%, respectively32% and 35%, respectively
0.6
0.8
1.0
1.2
ed O
ff-ch
ip m
emes
s co
unts
0.0
0.2
0.4
1P-2
8
1P-5
2
1P-7
6
P-100
2P-2
8
2P-5
2
2P-7
6
P-100
4P-2
8
4P-5
2
4P-7
6
P-100
Pmix1
Pmix2
Pmix3
Pmix4
Pmix5
Pmix6
Nor
mal
ize
acce
7878
1P 1P 1P 1P- 2P 2P 2P 2P
- 4P 4P 4P 4P-
4P-m
4P-m
4P-m
4P-m
4P-m
4P-m
BiC BiN-Paged BiN-Dyn BiN-Full
Comparison results of off-chip memory accesses
Impact on EnergyImpact on EnergyImpact on EnergyImpact on Energy♦♦ AS consumes the least perAS consumes the least per--cache/buffer access energy and the least unit leakagecache/buffer access energy and the least unit leakage
Because in the accelerator store the buffer and cache are two separate unitsBecause in the accelerator store the buffer and cache are two separate unitspp
♦♦ BiNBiN--DynDyn
Saves energy in cases where it can reduce the offSaves energy in cases where it can reduce the off--chip memory accesses and runtime chip memory accesses and runtime
Results in a large energy overhead in cases where it significantly increases the runtimeResults in a large energy overhead in cases where it significantly increases the runtime
♦♦ Compared with the AS, BiNCompared with the AS, BiN--Full reduces the energy by 12% on averageFull reduces the energy by 12% on average
E i 4PE i 4P ii {2 3}{2 3}Exception: 4PException: 4P--mixmix--{2,3}{2,3}●● The 1.32X capacity of AS can better satisfy buffer requestsThe 1.32X capacity of AS can better satisfy buffer requests
♦♦ Compared with BiC, BinCompared with BiC, Bin--Full reduces the energy by 29% on averageFull reduces the energy by 29% on averagep ,p , gy y ggy y g
1.21.41.61.8
Mem
ory
ener
gy
0 00.20.40.60.81.0
Nor
mal
ized
subs
yste
m
7979
0.0
1P-2
8
1P-5
2
1P-7
6
1P-1
00
2P-2
8
2P-5
2
2P-7
6
2P-1
00
4P-2
8
4P-5
2
4P-7
6
4P-1
00
4P-m
ix1
4P-m
ix2
4P-m
ix3
4P-m
ix4
4P-m
ix5
4P-m
ix6
BiC BiN-Paged BiN-Dyn BiN-Full
Examples of EnergyExamples of Energy--Efficient CustomizationEfficient CustomizationExamples of EnergyExamples of Energy Efficient CustomizationEfficient Customization
♦♦ Customization of processor coresCustomization of processor cores♦♦ Customization of onCustomization of on--chip memorychip memory♦♦ Customization of onCustomization of on--chip interconnectschip interconnects♦♦ Customization of onCustomization of on chip interconnectschip interconnects
8080
Terahertz VCO in 65nm CMOSTerahertz VCO in 65nm CMOSTerahertz VCO in 65nm CMOSTerahertz VCO in 65nm CMOS♦♦ Demonstrated an ultra high Demonstrated an ultra high
frequency and low power oscillator frequency and low power oscillator Measured signal spectrum with Measured signal spectrum with
uncalibrated poweruncalibrated power frequency and low power oscillator frequency and low power oscillator structure in CMOS by adding a structure in CMOS by adding a negative resistance parallel tank, negative resistance parallel tank, with the fundamental frequency at with the fundamental frequency at
uncalibrated poweruncalibrated power
with the fundamental frequency at with the fundamental frequency at 217GHz and 16.8 mW DC power 217GHz and 16.8 mW DC power consumption. consumption.
♦♦ The measured 4The measured 4thth and 6and 6thth
harmonics are about 870GHz and harmonics are about 870GHz and 1.3THz, respectively. 1.3THz, respectively. p yp y
higher harmonics (4th and 6th harmonics) ma behigher harmonics (4th and 6th harmonics) ma behigher harmonics (4th and 6th harmonics) may be higher harmonics (4th and 6th harmonics) may be substantially underestimated due to excessive water substantially underestimated due to excessive water
and oxygen absorption and setup losses at these and oxygen absorption and setup losses at these frequencies.frequencies.
8181
““Generating Terahertz Signals in 65nm CMOS with NegativeGenerating Terahertz Signals in 65nm CMOS with Negative--Resistance Resonator Boosting and Selective Harmonic SuppressionResistance Resonator Boosting and Selective Harmonic Suppression””
Symposium on VLSI Technology and Circuits, June 2010Symposium on VLSI Technology and Circuits, June 2010
Use of Multiband RF-Interconnect for Customization
Sig
na
l Po
wer
Sig
nal P
owe
r
trum
Sig
nal P
ower
Sig
nal
Po
we
r
•• In TX each mixer upIn TX each mixer up--converts individual baseband streams intoconverts individual baseband streams into
Sig
nal S
pect
•• In TX, each mixer upIn TX, each mixer up--converts individual baseband streams into converts individual baseband streams into specific frequency band (or channel)specific frequency band (or channel)
•• N different data streams (N=6 in exemplary figure above) may transmit N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve highersimultaneously on the shared transmission medium to achieve highersimultaneously on the shared transmission medium to achieve higher simultaneously on the shared transmission medium to achieve higher aggregate data rates aggregate data rates
•• In RX, individual signals are downIn RX, individual signals are down--converted by mixer, and recovered converted by mixer, and recovered after lowafter low--pass filterpass filter
8282
Mesh Overlaid with RF-I [HPCA’08]
♦ 10x10 mesh of pipelined routersN C t 2GHNoC runs at 2GHzXY routing
♦ 64 4GHz 3-wide processor coresLabeled aquaLabeled aqua8KB L1 Data Cache8KB L1 Instruction Cache
♦ 32 L2 Cache Banks ♦ 32 L2 Cache Banks Labeled pink256KB eachOrganized as shared NUCA cacheg
♦ 4 Main Memory InterfacesLabeled green
♦ RF-I transmission line bundleBlack thick line spanning mesh
8383
RF-I Logical Organization
•• Logically:Logically:•• Logically:Logically:
-- RFRF--I behaves as set of I behaves as set of
N express channelsN express channels
-- Each channel assigned Each channel assigned gg
to to srcsrc, , destdest router pair (router pair (ss,,dd))
•• Reconfigured by:Reconfigured by:
remapping remapping shortcuts to shortcuts to match match needs of differentneeds of different applicationsapplications
LOGICAL ALOGICAL ALOGICAL BLOGICAL B
8484
Latest Progress: Die Photo of STLLatest Progress: Die Photo of STL--DBI TransceiverDBI Transceivergg
Controller SideController Side Memory SideMemory SideActive Area: 0.12mmActive Area: 0.12mm2 2 (15% smaller than Ref. [4])(15% smaller than Ref. [4])
8585
( [ ])( [ ])[[4] G.4] G.--S. S. ByunByun, et al., ISSCC 2011, et al., ISSCC 2011
Comparison with StateComparison with State--ofof--thethe--artartppThis Work JSSC 2009[1] ISSCC 2009[2] JSSC 2010[3] ISSCC 2011[4]
Technology 65nm 180nm 130nm 40nm 65nm
Signaling Single-Ended Single-Ended Single-Ended Differential Differential
Link Type SBD Bidirectional Bidirectional Bidirectional SBD
Aggregate 4 2Gb/s/pinAggregatedata rate/pin
8Gb/s/pin 5Gb/s/pin 6Gb/s/pin 2.15Gb/s/pin4.2Gb/s/pin
Total Power14.4mW(BB)17 6 W(RF)
87mW 95mW 14.4mW12mW (BB)11 W (RF)17.6mW(RF) 11mW (RF)
Energy/bit/pin 4pJ/bit/pin 17.4pJ/bit/pin 15.8pJ/bit/pin 6.6pJ/bit/pin 5pJ/bit/pin
Chip Area 0.12mm2 0.52mm2 0.30mm2 0.9mm2 0.14mm2Chip Area 0.12mm 0.52mm 0.30mm 0.9mm 0.14mm
FoM 2.08 0.11 0.21 0.17 1.30
pinGbpinDRFoM
//
mJmm
p
PowerArea
pFoM
⋅=
⋅=
2
[1] K.[1] K.--I. Oh, et al., JSSC2009 (Samsung)I. Oh, et al., JSSC2009 (Samsung)
[2] K.[2] K.--S. Ha, et al., ISSCC2009 (Samsung)S. Ha, et al., ISSCC2009 (Samsung)
8686
[2] K.[2] K. S. Ha, et al., ISSCC2009 (Samsung)S. Ha, et al., ISSCC2009 (Samsung)
[3] B. Leibowitz, et al., JSSC2010 (Rambus)[3] B. Leibowitz, et al., JSSC2010 (Rambus)
[4] G.[4] G.--S. Byun, et al., ISSCC 2011 (UCLA)S. Byun, et al., ISSCC 2011 (UCLA)
Research Scope in CDSC (Center for DomainResearch Scope in CDSC (Center for Domain--Specific Computing)Specific Computing)
Customizable Heterogeneous Platform Customizable Heterogeneous Platform (CHP)(CHP)
$$ $$ $$ $$ DRAMDRAM I/OI/O CHPCHP
Specific Computing)Specific Computing)
$$ $$ $$ $$
FixedFixedCoreCore
FixedFixedCoreCore
FixedFixedCoreCore
FixedFixedCoreCore
DRAMDRAM
DRAMDRAM
I/OI/O
CHPCHP
CHPCHP
CHPCHP
CustomCustomCoreCore
CustomCustomCoreCore
CustomCustomCoreCore
CustomCustomCoreCore DomainDomain--specificspecific--modelingmodeling
(healthcare applications)(healthcare applications)
ProgProgFabricFabric
ProgProgFabricFabric
acceleratoracceleratoracceleratoraccelerator acceleratoracceleratoracceleratoraccelerator
(healthcare applications)(healthcare applications)
Reconfigurable RFReconfigurable RF--I busI busReconfigurable optical busReconfigurable optical busTransceiver/receiverTransceiver/receiverOptical interfaceOptical interface
A hit t A hit t CHP mappingCHP mapping
SourceSource--toto--source CHP mapper source CHP mapper Reconfiguring & optimizing backendReconfiguring & optimizing backend
CHP creationCHP creationCustomizable computing engines Customizable computing engines
Customizable interconnectsCustomizable interconnects
Architecture Architecture
modelingmodeling
8787
Adaptive runtimeAdaptive runtimeCustomizable interconnectsCustomizable interconnectsCustomization Customization
settingsettingDesign onceDesign once Invoke many timesInvoke many times
CHP Mapping OverviewCHP Mapping OverviewCHP Mapping OverviewCHP Mapping OverviewGoal: Goal: Efficient mapping of domainEfficient mapping of domain--specific application to customizable hardwarespecific application to customizable hardware
Adapt the CHP to a given application so as to optimize performance/power efficiencyAdapt the CHP to a given application so as to optimize performance/power efficiency
DomainDomain--specific applicationsspecific applications
Abstract Abstract titi
ProgrammerProgrammerexecutionexecution
gg
DomainDomain--specific programming modelspecific programming model(Domain(Domain--specific coordination graph and domainspecific coordination graph and domain--specific language extensions)specific language extensions)
SourceSource--to source CHP Mapper (Rose)to source CHP Mapper (Rose)
Application Application characteristicscharacteristicsCHP architecture CHP architecture
d ld lpp ( )pp ( )
modelsmodels
C/C++ codeC/C++ code
C/C++ frontC/C++ front--endend
C/C++C/C++Accelerator kernelAccelerator kernelROSE SAGE IRROSE SAGE IR
ROSE ROSE →→ LLVM LLVM translatortranslator
Reconfiguring and optimizing backReconfiguring and optimizing back--end (LLVM)end (LLVM)
Binary code for fixed & Binary code for fixed & Accelerator code Accelerator code
Accelerator Accelerator compiler/librarycompiler/library
Performance Performance RTL for prog fabricRTL for prog fabric
RTL Synthesizer RTL Synthesizer (AutoPilot/xPilot)(AutoPilot/xPilot)
translatortranslator
customized corescustomized coresPerformance Performance feedbackfeedback
Unified Adaptive Runtime systemUnified Adaptive Runtime system(maps tasks across CPUs, GPUs, Accelerators, FPGA processors)(maps tasks across CPUs, GPUs, Accelerators, FPGA processors)
8888
CHP architectural prototypesCHP architectural prototypes(CHP hardware testbeds, CHP simulation (CHP hardware testbeds, CHP simulation testbed, full CHP)testbed, full CHP)
xPilot: Behavioral-to-RTL Synthesis Flow [SOCC’2006]
Behavioral spec. Behavioral spec. in C/C++/SystemCin C/C++/SystemC
Advanced transformtion/optimizationsAdvanced transformtion/optimizationsLoop unrolling/shifting/pipeliningLoop unrolling/shifting/pipeliningin C/C++/SystemCin C/C++/SystemC Loop unrolling/shifting/pipeliningLoop unrolling/shifting/pipeliningStrength reduction / Tree height reductionStrength reduction / Tree height reductionBitwidth analysisBitwidth analysisM l i M l i
FrontendFrontendPlatform Platform
descriptiondescription Memory analysis …Memory analysis …FrontendFrontendcompilercompiler
descriptiondescription
Core behvior synthesis optimizationsCore behvior synthesis optimizationsS h d liS h d li
SSDMSSDM
SchedulingSchedulingResource binding, e.g., functional unit Resource binding, e.g., functional unit binding register/port bindingbinding register/port binding
RTL + constraintsRTL + constraints
SSDMSSDM
μμArchArch--generation & RTL/constraints generation & RTL/constraints generationgenerationRTL + constraintsRTL + constraints generationgeneration
Verilog/VHDL/SystemCVerilog/VHDL/SystemCFPGAs: Altera, Xilinx FPGAs: Altera, Xilinx ASICs: Magma Synopsys ASICs: Magma Synopsys
8989
ASICs: Magma, Synopsys, …ASICs: Magma, Synopsys, …FPGAs/ASICsFPGAs/ASICs
AutoPilot Compilation Tool (based UCLA xPilot system)p ( y )
C/C++/SystemCC/C++/SystemC User ConstraintsUser Constraints
Design SpecificationDesign Specification
♦ Platform-based C to FPGA synthesis
♦ Synthesize pure ANSI-C and
Sim
ulati
Sim
ulati
Compilation & Compilation & ElaborationElaboration
AutoPilotAutoPilotTMTM
Co
mm
on
C
om
mo
n ♦ Synthesize pure ANSI C and
C++, GCC-compatible compilation flow
♦ Full support of IEEE-754
ion
, Verific
ion
, Verific
Presynthesis OptimizationsPresynthesis Optimizations
Testben
chTestb
ench
ES
L S
ynt
ES
L S
ynt ♦ Full support of IEEE 754
floating point data types & operations
♦ Efficiently handle bit-accurate
Platform Platform Characterization Characterization
LibraryLibrary
==
ation
, and
atio
n, an
d
Behavioral & CommunicationBehavioral & Communication
Synthesis and OptimizationsSynthesis and Optimizations
hh hesis
hesis
♦ Efficiently handle bit accurate fixed-point arithmetic
♦ More than 10X design productivity gain
Timing/Power/Layout Timing/Power/Layout ConstraintsConstraints
RTL HDLs &RTL HDLs &RTL SystemCRTL SystemC
LibraryLibraryPro
totyp
inP
roto
typin p y g
♦ High quality-of-results
FPGAFPGACoCo--ProcessorProcessor
ng
ng
9090
CoCo--ProcessorProcessorDeveloped by AutoESL, acquired by Xilinx in Jan. 2011Developed by AutoESL, acquired by Xilinx in Jan. 2011
Programming Model and Runtime Support Programming Model and Runtime Support [LCTES12][LCTES12]
♦♦ Concurrent Collection (CnC) programming model Concurrent Collection (CnC) programming model Cl ti b t li ti d i ti d Cl ti b t li ti d i ti d Clear separation between application description and Clear separation between application description and implementationimplementation
Fits domain expert needsFits domain expert needsFits domain expert needsFits domain expert needs
♦♦ CnCCnC--HC: Software flow CnC => HabaneroHC: Software flow CnC => Habanero--C(HC)C(HC)♦♦ CrossCross--device workdevice work--stealing in Habanerostealing in Habanero--CC
Task affinity with heterogeneous componentsTask affinity with heterogeneous components
♦♦ Data driven runtime in CnCData driven runtime in CnC--HCHC
9191
CnC Building BlocksCnC Building BlocksCnC Building BlocksCnC Building Blocks
♦♦ StepsStepsC t ti l itC t ti l itComputational unitsComputational units
Functional with respects to their inputsFunctional with respects to their inputs
♦♦ Data ItemsData ItemsMeans of communication between stepsMeans of communication between steps
Dynamic single assignmentDynamic single assignment
♦♦ Control ItemsControl ItemsUsed to create (prescribe) instances of a computation stepUsed to create (prescribe) instances of a computation step
9292
HCHC--1ex architecture1ex architecture
“Commodity” Intel Server“Commodity” Intel Server Convey FPGAConvey FPGA--based coprocessorbased coprocessor
Intel® Intel® Xeon® Xeon®
I t l® I t l®
Application Application Engine Hub Engine Hub (AEH)(AEH)
Application Engines Application Engines (AEs)(AEs)
Direct Direct Data Data PortPort
ProcessorProcessor Intel® Intel® Memory Memory Controller Controller
(AEH)(AEH)
Xeon QuadXeon Quad
XC6vlx760 FPGAsXC6vlx760 FPGAs80GB/s off80GB/s off--chip bandwidthchip bandwidth94W Design Po er94W Design Po er( )( )Hub (MCH)Hub (MCH)
Xeon QuadXeon QuadCore LV5408Core LV540840W TDP40W TDP
94W Design Power94W Design Power
Intel® I/O Intel® I/O SubsystemSubsystem
MemoryMemory MemoryMemory
St d d I t l® 86St d d I t l® 86 64 64 C C Standard Intel® x86Standard Intel® x86--64 64 ServerServerx86x86--64 Linux64 Linux
Convey coprocessorConvey coprocessorFPGAFPGA--basedbasedShared cacheShared cache--coherent memorycoherent memory
Tesla C1060Tesla C1060100GB/ ff100GB/ ff hi b d idthhi b d idth
9393
100GB/s off100GB/s off--chip bandwidthchip bandwidth200W TDP200W TDP 93
Runtime Support Experimental resultsRuntime Support Experimental results
♦♦ Performance for medical imaging kernelsPerformance for medical imaging kernels♦♦ Performance for medical imaging kernelsPerformance for medical imaging kernels
Denoise Registration Segmentationg g
Num iterations 3 100 50
CPU (1 core) 3 3s 457 8s 36 76sCPU (1 core) 3.3s 457.8s 36.76s
GPU 0.085s (38.3 ×) 20.26s (22.6 ×) 1.263s (29.1 ×)
FPGA 0.190s (17.2 ×) 17.52s (26.1 ×) 4.173s (8.8 ×)
9494
Experimental Results (Cont’d)Experimental Results (Cont’d)
• Execution times and active energy with dynamic • Execution times and active energy with dynamic work stealing
9595
Static vs Dynamic bindingStatic vs Dynamic binding
♦♦ Static bindingStatic binding♦♦ Static bindingStatic binding
♦♦ Dynamic BindingDynamic Binding
9696
96
Concluding RemarksConcluding Remarks♦♦ Despite of end of scaling, there is plenty of opportunity with Despite of end of scaling, there is plenty of opportunity with
customization and specialization for energy efficient computingcustomization and specialization for energy efficient computingcustomization and specialization for energy efficient computingcustomization and specialization for energy efficient computing
♦♦ Many opportunities and challenges for architecture supportMany opportunities and challenges for architecture supportCoresCores
Accelerators
Memoryy
Network-on-chips
♦♦ Software support is also critical Software support is also critical
9797
Acknowledgements: CDSC Faculty
Aberle Aberle (UCLA)(UCLA)
Baraniuk Baraniuk (Rice)(Rice)
Bui Bui (UCLA)(UCLA)
Cong (Director) Cong (Director) (UCLA)(UCLA)
Cheng Cheng (UCSB)(UCSB)
Chang Chang (UCLA)(UCLA)
Reinman Reinman (UCLA)(UCLA)
Palsberg Palsberg (UCLA)(UCLA)
Sadayappan Sadayappan (Ohio(Ohio--State)State)
SarkarSarkar(Associate Dir) (Associate Dir)
(Ri )(Ri )
Vese Vese (UCLA)(UCLA)
Potkonjak Potkonjak (UCLA)(UCLA)
9898
(Rice)(Rice)
More Acknowledgements
Mohammad Ali GhodratMohammad Ali Ghodrat
Yi ZouYi ZouChunyue Chunyue Hui HuangHui HuangMichael GillMichael Gill BeaynaBeayna
Thi h i ti ll t d b th C t f D iThi h i ti ll t d b th C t f D i S ifi S ifi
Yi ZouYi ZouChunyue Chunyue LiuLiu
Hui HuangHui HuangMichael GillMichael Gill BeaynaBeaynaGrigorianGrigorian
♦♦ This research is partially supported by the Center for DomainThis research is partially supported by the Center for Domain-- Specific Specific Computing (CDSC) funded by the NSF Expedition in Computing Award CCFComputing (CDSC) funded by the NSF Expedition in Computing Award CCF--0926127, GSRC under contract 20090926127, GSRC under contract 2009--TJTJ--1984.1984.
9999
100100