synthesis of custom processors based on extensible platforms
DESCRIPTION
Synthesis of Custom Processors based on Extensible Platforms. Fei Sun + , Srivaths Ravi ++ , Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical Engineering Princeton University ++ : NEC Laboratories America, Inc. Outline. SoC design constraints Background - PowerPoint PPT PresentationTRANSCRIPT
-
Synthesis of Custom Processors based on Extensible PlatformsFei Sun+, Srivaths Ravi++, Anand Raghunathan++ and Niraj K. Jha++: Dept. of Electrical EngineeringPrinceton University++: NEC Laboratories America, Inc.
-
OutlineSoC design constraintsBackgroundPrevious work in ASIP designXtensa platformManual custom instruction generation procedureAutomatic custom instruction generation flowExperimental resultsConclusions
-
SoC Design ConstraintsTime to marketCostPerformancePowerCost-performance trade-offFlexibility
-
Comparison of Different ApproachesASICASIPGPPTime to market -- + ++ Cost ++ + -- Performance ++ + -- Power ++ + -- Cost-performance ++ + -- Flexibility -- + ++ ++ Very good + Good -- Very bad
-
Flexibility vs. Energy Efficiency
-
Previous Work in ASIP DesignASIP architectures and overall design methodologies[Huang, 1994], [Adams, 1996], [Fisher, 1999], [Kucukcakar, 1999]Application-specific instruction set selection[Choi, 1999], [Gschwind, 1999], [Arnold, 1999] Low power ASIP design[Kalambur, 1997], [Dougherty, 1999], [Ishihara, 2000], [Sami, 2001]Commercial offeringsXtensa, ARCtangent, Jazz, SP-5flex, Carmel
-
Xtensa ArchitectureProcessor ControlsTRACE PortJTAG Tap ControlOn Chip DebugAlign and DecodeCoprocessor Register FileCoprocessor Execution UnitsWindow Register FileALU & Address GenerationMAC 16Designer Defined Instruction Execution UnitInstruction Memory or Cache & TagsBranch Logic & Instruction FetchDate Memory or Cache &TagsProcessor InterfaceWrite BufferTimers 1 to nSpecial Function Register AccessData Address Watch 0 to n Instruction Address Watch 0 to nInstructionBase ISA FeatureConfigurable FunctionOptional FunctionConfigurable & Optional FunctionExtensibleDataInstruction AddressData AddressException SupportInterrupt ControlMemory Protection UnitSource: www.tensilica.com
-
Xtensa Processor Design FlowProcessor Configuration InputsDesigner-Defined Instruction DescriptionsConfiguration FileConfigured GNU C/C++ CompilerConfigured GNU Assembler/ DisassemblerConfigured Instruction Set Simulator/EmulatorConfigured Processor HDLArea, Power and Timing EstimationApplication Source CodeSample Application DataOptimized SoftwareOptimized HardwareGenerator OutputInternal DatabaseDesign dataUse of Generated DataSource: www.tensilica.com
-
Manual Custom Instruction Generation ProcedureIdentify potential new instructionsDescribe custom instructionsInsert custom instructionsVerify functional correctnessProfile, read source codeUnderstand source codeRewrite source codeSlow and error-prone
-
Contributions of Our WorkAutomatic custom instruction selectionApplication program to extensible processors with custom instructionsFeaturesEfficient design space searchUse accurate information from instruction set simulator and synthesisBridge the gap between automatic synthesized and manually designed architectures
-
Automatic Custom Instruction Generation Flow
Title
Application program (C)
Generate individual custom instr
Profile C program
6 - 13
Generate program dependence graphs
Rank control blocks
Generate templates
Select templates
1
2
3
4
5
Select custom instr combination
Generate custom instr combination
Build processor
14
15
16
17
18
19
Aristotle analysis system
Profiler (xt-gprof)
Synthesize custom instr combination
Clock period/area constraints met?
Next instr combination
N
Profile C with instr combination
Y
Synthesize processor
20
-
Automatic Custom Instruction Generation Flow
Title
Application program (C)
Generate individual custom instr
Profile C program
6 - 13
Generate program dependence graphs
Rank control blocks
Generate templates
Select templates
1
2
3
4
5
Select custom instr combination
Generate custom instr combination
Build processor
14
15
16
17
18
19
Aristotle analysis system
Profiler (xt-gprof)
Synthesize custom instr combination
Clock period/area constraints met?
Next instr combination
N
Profile C with instr combination
Y
Synthesize processor
20
-
Example Illustration of Template Generation
c = a & 0xff;// node 1d = b & 0xff + c;// node 2e = d
-
Example Illustration of Template Generation
c = a & 0xff;// node 1d = b & 0xff + c;// node 2e = d
-
Example Illustration of Template Generation
1
2
3
4
2
1
3
4
0.03
0.03
0.03
0.06
a
f
b
c
d
e
g
c = a & 0xff;// node 1d = b & 0xff + c;// node 2e = d
-
Example Illustration of Template Generation
2
3
1
2
3
4
Basic templates
1
2
3
Dependent templates
1
2
2
1
3
4
0.03
0.03
0.03
0.06
a
f
b
c
d
e
g
c = a & 0xff;// node 1d = b & 0xff + c;// node 2e = d
-
Example Illustration of Template Generation
1
2
3
4
Basic templates
1
2
3
1
2
2
3
2
4
3
4
1
2
4
2
3
4
1
2
3
4
1
4
Dependent templates
Independent templates
2
1
3
4
0.03
0.03
0.03
0.06
a
f
b
c
d
e
g
c = a & 0xff;// node 1d = b & 0xff + c;// node 2e = d
-
Key Observations for Pruning
Higher the weight of the template, higher the potential for improvement --- Amdahls lawScope for optimization determined by computation --- No. of cycles needed for executing the templateScope for optimization determined by read/write ports limitation --- Additional cycles needed for extra reading/writing of input/output variables
-
Pruning AlgorithmRanking criterion:
OriginalTime: Fraction of the total execution time of the original program spent in the template (weight)In, Out: Number of inputs and outputs of the template, respectively, : Number of inputs/outputs encoded in the instruction: No. of cycles needed for executing the templateHigher priority means greater potential for speed up
-
Template Generation with Pruning10.517.924.052.13Ranked pool of seed templatesThreshold: 0.1Template set
-
Template Generation with Pruning12.73Highest priority1.1816.35Threshold: 0.1Template setRanked pool of seed templates
-
Template Generation with Pruning12.73Highest priority16.35Threshold: 0.1Template setRanked pool of seed templates
-
Template Generation with Pruning12.73Highest priority16.35Threshold: 0.1Template setRanked pool of seed templates
-
No. of Templates vs. Threshold Ratio
-
Automatic Custom Instruction Generation Flow
Title
Application program (C)
Generate individual custom instr
Profile C program
6 - 13
Generate program dependence graphs
Rank control blocks
Generate templates
Select templates
1
2
3
4
5
Select custom instr combination
Generate custom instr combination
Build processor
14
15
16
17
18
19
Aristotle analysis system
Profiler (xt-gprof)
Synthesize custom instr combination
Clock period/area constraints met?
Next instr combination
N
Profile C with instr combination
Y
Synthesize processor
20
-
Automatic Custom Instruction Generation Flow (Contd.)
Title
Select templates
Generate individual custom instr
6 - 13
Next template
5
6
7
All templates built?
8
9
10
11
12
13
N
Y
Extract templates
Generate custom instr
Generate RTL Verilog
Synthesize Verilog
Profile C with custom instr
Clock period constraint met?
Insert custom instr
TIE compiler
Synopsys design compiler
Y
N
Increase number of cyclesor increase clock period
-
Automatic Custom Instruction Generation Flow (Contd.)
Title
Select templates
Generate individual custom instr
6 - 13
Next template
5
6
7
All templates built?
8
9
10
11
12
13
N
Y
Extract templates
Generate custom instr
Generate RTL Verilog
Synthesize Verilog
Profile C with custom instr
Clock period constraint met?
Insert custom instr
TIE compiler
Synopsys design compiler
Y
N
Increase number of cyclesor increase clock period
-
Custom Instruction InsertionCare must be taken to insert custom instructions into appropriate places without affecting programs functional correctnessIf custom instructions need extra inputs (outputs), care must be taken to select appropriate variables to write to (read from) user-defined registers
-
Example Illustration of Custom Instruction Insertion
1
4
3
5
2
t = s >> 24;// 1r = t & 0xff;// 2a[5] = t + d;// 3m = b[0];// 4y = x + m;// 5
3
4
1,2,5
m = b[0];// 4y = CustomInstr(s,m);//1,2,5t = RUR(0);//1,2,5a[5] = t + d;// 3
(a)
(b)
- Example Illustration of Custom Instruction Insertion (Contd.)(a) (b).... offset = t + 1; for (i=0; i
-
Automatic Custom Instruction Generation Flow
Title
Application program (C)
Generate individual custom instr
Profile C program
6 - 13
Generate program dependence graphs
Rank control blocks
Generate templates
Select templates
1
2
3
4
5
Select custom instr combination
Generate custom instr combination
Build processor
14
15
16
17
18
19
Aristotle analysis system
Profiler (xt-gprof)
Synthesize custom instr combination
Clock period/area constraints met?
Next instr combination
N
Profile C with instr combination
Y
Synthesize processor
20
-
Custom Instruction Combination Selection --- Problem StatementGiven a set of non-overlapping custom instructions, with each instruction having several versions, find a version for each instruction such that performance is maximized while area is under a certain threshold
-
Custom Instruction Combination Selection --- Flow Chart
Start
All instrs analyzed?
Add current version of current instr to solution
Performance upper bound is among the best?
Area meets constraint?
All versions considered?
Stop
Performance is among the best?
Update best solutions
N
Y
Y
Y
Y
Y
N
N
N
Next version
Next instruction(recursive call)
N
Start
All instrs analysized?
Add current version of current instr in solution
Performance up bound is among the best?
Area is under maximum?
All versions considered?
Stop
Performance is among the best?
Update best solutions
N
Y
Y
Y
Y
Y
N
N
Next version
Next instructionrecursive call
Start
All instrs analysized?
Add current version of current instr in solution
Performance up bound is among the best?
Area is under maximum?
All versions considered?
Stop
Performance is among the best?
Update best solutions
N
Y
Y
Y
Y
Y
N
N
Next version
Next instructionrecursive call
-
Automatic Custom Instruction Generation Flow
Title
Application program (C)
Generate individual custom instr
Profile C program
6 - 13
Generate program dependence graphs
Rank control blocks
Generate templates
Select templates
1
2
3
4
5
Select custom instr combination
Generate custom instr combination
Build processor
14
15
16
17
18
19
Aristotle analysis system
Profiler (xt-gprof)
Synthesize custom instr combination
Clock period/area constraints met?
Next instr combination
N
Profile C with instr combination
Y
Synthesize processor
20
-
Experimental MethodologyC ProgramAutomatic Custom Instruction GenerationAristotleXtensa TIE CompilerSynopsys Design CompilerXtensa GNU ProfilerTensilica Processor GeneratorSynopsys Design CompilerModified C program Cross CompilerISSSente WattwatcherAreaClock PeriodExecution CyclesPower
-
Experimental Results (Contd.)AveragePerformance improvement: 3.4X Energy reduction: 3.2XEnergy*delay reduction: 12.6X Area increase: 1.8%
-
ConclusionsAutomatic custom instruction synthesis for ASIPsTemplate generation/selectionCustom instruction insertionCustom instruction combination selectionExperimental results3.4X average performance improvement12.6X average energy*delay reduction