design automation methodologies for extensible processors platform
DESCRIPTION
Design Automation Methodologies for Extensible Processors Platform. Newton Cheung School of Computer Science & Engineering The University of New South Wales Sydney, Australia. Outline. System-On-Chips Design Challenges Extensible Processors Platform Background – Xtensa - PowerPoint PPT PresentationTRANSCRIPT
Design Automation Methodologies Design Automation Methodologies for Extensible Processors Platformfor Extensible Processors Platform
Newton CheungNewton Cheung
School of Computer Science & EngineeringSchool of Computer Science & EngineeringThe University of New South WalesThe University of New South Wales
Sydney, AustraliaSydney, Australia
OutlineOutline
• System-On-Chips Design ChallengesSystem-On-Chips Design Challenges• Extensible Processors PlatformExtensible Processors Platform• Background – XtensaBackground – Xtensa• Problems in Extensible Processors PlatformProblems in Extensible Processors Platform• The Goal of the researchThe Goal of the research• Proposed SolutionProposed Solution• INSIDE INSIDE • MINCEMINCE• ConclusionsConclusions
SoC Design ChallengesSoC Design Challenges
Increasing software contentIncreasing software content
Customizing the architecture to Customizing the architecture to the specific application the specific application (application domain)(application domain)
• Power consumptionPower consumption
• Chip areaChip area
• CostCost
• Application functionalityApplication functionality
• Ever changing nature of embedded productEver changing nature of embedded product
FlexibilityFlexibility
EfficiencyEfficiency
Flexibility vs. Efficiency (Energy)Flexibility vs. Efficiency (Energy)
General Purpose Processor
0.1-1 MIPS/mw
1-10 MIPS/mw
50-100 MIPS/mw
500-1000 MIPS/mw
En
erg
y E
ffic
ien
cy
Fle
xib
ilit
y
Application Specific Instruction-set Processor (ASIP)
Fle
xib
ilit
y
Application Specific Integrated Circuit (ASIC)
Domain SpecificProcessor (DSP)
Source: Fei Sun @ ICCAD 2002Source: Fei Sun @ ICCAD 2002
Flexibility vs. Efficiency (Performance)Flexibility vs. Efficiency (Performance)
ASICASIP
FPGAGPP
Pe
rfo
rma
nc
e
Flexibility
General Purpose ProcessorGeneral Purpose ProcessorApplication Specific ProcessorApplication Specific Processor
Optimized Processor Design (ASIP)Optimized Processor Design (ASIP)
• CostCost– Small sizeSmall size
• PerformancePerformance– Application extensions Application extensions
• ProductivityProductivity– Rapid hardware Rapid hardware
and softwareand software
Extensible Processors PlatformExtensible Processors Platform
• Represents the state-of-the-art in Represents the state-of-the-art in application specific application specific instruction-set processor (ASIP)instruction-set processor (ASIP)
• Consists of a Consists of a base processorbase processor containing a containing a base base instruction set, instruction set, plus the capability to plus the capability to customize their customize their architecture architecture to replace computationally intensive code to replace computationally intensive code segmentssegments
• The goal of designing extensible processors is The goal of designing extensible processors is typically to typically to maximize the performancemaximize the performance of an of an application, while application, while meeting design constraintsmeeting design constraints
Extensible Processors PlatformExtensible Processors Platform• Enables to address Enables to address threethree architectural levels on the architectural levels on the
base processor:base processor: – Inclusion/Exclusion of Inclusion/Exclusion of predefined blockspredefined blocks– InstructionsInstructions extension extension– Parameterizations Parameterizations
I:Instruction Fetch
R:Instruction Decode / Register Fetch
E: Execute / Effective Address
M: Memory Access / Branch Complete
W: Write Back
Instruction ROM / RAM / Cache
Write Back / Exception Resolution
Decode / General Registers
ALU / Address Generation
Data ROM / RAM / Cache
Custom / Specific Instruction
Predefined Blocks (e.g. DSP – Digital Signal Processing, Floating-Point Unit)
Predefined Block Register (i.e. 64-bit)
• System-On-Chips Design ChallengesSystem-On-Chips Design Challenges• Extensible Processors PlatformExtensible Processors Platform• Background – XtensaBackground – Xtensa• Problems in Extensible Processors PlatformProblems in Extensible Processors Platform• The Goal of the researchThe Goal of the research• Proposed SolutionProposed Solution• INSIDE INSIDE • MINCEMINCE• ConclusionsConclusions
OutlineOutline
Background (Xtensa)Background (Xtensa)
Source: www.tensilica.com Source: www.tensilica.com
Xtensa ArchitectureXtensa ArchitectureSource: www.tensilica.com Source: www.tensilica.com
Xtensa Custom Instructions (TIE)Xtensa Custom Instructions (TIE)
Source: www.tensilica.com Source: www.tensilica.com
Xtensa’s Design FlowXtensa’s Design Flow
Source: www.tensilica.com Source: www.tensilica.com
Extensible Processor (Why?) Extensible Processor (Why?)
Standard Standard ProcessorProcessor
Extensible Extensible ProcessorProcessor
ASIC ASIC (RTL Logic)(RTL Logic)
Application tuned Application tuned data pathsdata paths
NONO Yes: High-level Yes: High-level TIETIE
Low-level RTL Low-level RTL
Task ControlTask Control C/C++C/C++ C/C++C/C++ NoNo
SimulationSimulation Fast Fast simulation or simulation or
boardboard
Fast simulation Fast simulation or boardor board
RTL simulation: RTL simulation: 100x slower100x slower
Multiple EnginesMultiple Engines LimitedLimited Simple Simple directed MP directed MP
interfaceinterface
Possible, but Possible, but hard to design hard to design
and modeland model
Example: Digital VideoExample: Digital VideoSource: www.tensilica.com Source: www.tensilica.com
Example: Multiple Processors for TOEExample: Multiple Processors for TOE
Source: www.tensilica.com Source: www.tensilica.com
Example: Voice GatewayExample: Voice GatewaySource: www.tensilica.com Source: www.tensilica.com
• System-On-Chips Design ChallengesSystem-On-Chips Design Challenges• Extensible Processors PlatformExtensible Processors Platform• Background – XtensaBackground – Xtensa• Problems in Extensible Processors PlatformProblems in Extensible Processors Platform• The Goal of the researchThe Goal of the research• Proposed SolutionProposed Solution• INSIDE INSIDE • MINCEMINCE• ConclusionsConclusions
OutlineOutline
Problems in Automation??Problems in Automation??
Source: www.tensilica.com Source: www.tensilica.com
Design Expertise
Required!!
Generic Design Flow of Extensible Generic Design Flow of Extensible ProcessorProcessor
Application in C/C++
Compilation
Analysis and Profiling
Identify:1) Predefined blocks2) Extensible instructions3) Parameter settings
Define/Select: 1) Predefined blocks 2) Extensible instructions3) Parameter settings
Evaluate: Compiler, Linker, Assembler, Debugger, Instruction Set Simulator
Explore extensible processor
design space
OK?
Synthesizable RTL of: Base processor Predefined blocks Extensible instructions
Generate extensible processor
Synthesis and tape out
Proto-typing
Instruction Library
No Yes
PreviousPrevious Work Workss
• Profiling/IdentificationProfiling/Identification– [Binh 1998], [Yang 2002], ARC, Xtensa[Binh 1998], [Yang 2002], ARC, Xtensa
• Design methodology for different aspectsDesign methodology for different aspects– [Gupta 2000], [Jain 2001][Gupta 2000], [Jain 2001]
• Instruction generation/selectionInstruction generation/selection– [Brisk 1998], [Kastner 2001], [Sun 2003], [Zhao [Brisk 1998], [Kastner 2001], [Sun 2003], [Zhao
2002]2002]
• Overall design flow for extensible processorsOverall design flow for extensible processors– [Kathail 2002], [Lee 2002], [Sun 2002][Kathail 2002], [Lee 2002], [Sun 2002]
• Vendor and academic Vendor and academic – ARC, Lisatek, Xtensa, ASIP-MeisterARC, Lisatek, Xtensa, ASIP-Meister
• System-On-Chips Design ChallengesSystem-On-Chips Design Challenges• Extensible Processors PlatformExtensible Processors Platform• Background – XtensaBackground – Xtensa• Problems in Extensible Processors PlatformProblems in Extensible Processors Platform• The Goal of the researchThe Goal of the research• Proposed SolutionProposed Solution• INSIDE INSIDE • MINCEMINCE• ConclusionsConclusions
OutlineOutline
Goal of the ResearchGoal of the Research
• To To automateautomate the design flow of extensible the design flow of extensible processors platformprocessors platform..
• Given an application and design constraints, Given an application and design constraints, the system configures an extensible the system configures an extensible processor that processor that maximizes the performancemaximizes the performance of of an application while an application while satisfying the design satisfying the design constraintsconstraints..
Proposed SolutionProposed Solution
Application in C/C++
Compilation
Analysis and Profiling
Identify:1) Predefined blocks2) Extensible instructions3) Parameter settings
Evaluate: Compiler, Linker, Assembler, Debugger, Instruction Set Simulator
Explore extensible processor
design space
OK?
Synthesizable RTL of: Base processor Predefined blocks Extensible instructions
Generate extensible processor
Synthesis and tape out
Proto-typing
Instruction Library
No Yes
Define/Select: 1) Predefined blocks 2) Extensible instructions3) Parameter settings
Application in C/C++
Compilation
Analysis and Profiling
Identify:1) Predefined blocks2) Extensible instructions3) Parameter settings
Evaluate: Compiler, Linker, Assembler, Debugger, Instruction Set Simulator
Explore extensible processor
design space
OK?
Synthesizable RTL of: Base processor Predefined blocks Extensible instructions
Generate extensible processor
Synthesis and tape out
Proto-typing
Instruction LibraryINSIDE
system
Define: 1) P.B.2) E.I.3) P.S.
Select: 1) P.B.2) E.I.3) P.S.
Performance Estimation Model1) Information from profiling2) Synthesized information of instruction
Use libraries of pre-configured processor and
instructions to limit the design space
Fitting Function:- Finds suitable code segment to implement as extensible instruction
MINCE tool- Match code segment to instruction
No Yes
1.1. INSIDEINSIDE system system– Identifies sections of code segments which are Identifies sections of code segments which are
suitable for translation to instructions.suitable for translation to instructions.– A two-level hierarchy approach.A two-level hierarchy approach.– A performance estimatorA performance estimator..ReduceReducess the the design turnaround timedesign turnaround time significantly. significantly.
2.2. MINCEMINCE tool tool– Match code segment with pre-synthesized Match code segment with pre-synthesized
extensible instruction using combinational extensible instruction using combinational equivalence.equivalence.
Enhances Enhances reusabilityreusability of the extensible instructions. of the extensible instructions.
INSIDE / MINCE INSIDE / MINCE
• System-On-Chips Design ChallengesSystem-On-Chips Design Challenges• Extensible Processors PlatformExtensible Processors Platform• Background – XtensaBackground – Xtensa• Problems in Extensible Processors PlatformProblems in Extensible Processors Platform• The Goal of the researchThe Goal of the research• Proposed SolutionProposed Solution• INSIDE INSIDE • MINCEMINCE• ConclusionsConclusions
OutlineOutline
INSIDE system (Overview)INSIDE system (Overview)Processor 3
Base
ProcessorDSP
Processor 2
Base
ProcessorFPU
Processor 1
Base
Processor
Compile, Simulate, andProfile
Execution Time EstimationPhase IV
Heuristic Algorithm(Part 1: for Selecting pre-configured processor)
Phase I
Heuristic Algorithm(Part 2: for Selecting pre-designed
extensible instruction)
Phase IIIPre-designed Extensible
Library
Methodology for Identifying Code Segment, ImplementingExtensible Instruction, and Extending the Instruction Library
Phase II
Processor with coprocessor & extensible instructions Execution Time
Phase I: Heuristic Algorithm (Part I)Phase I: Heuristic Algorithm (Part I)
• For selecting pre-configured processorFor selecting pre-configured processor
• The The area delay productarea delay product, , EPEPii of processor of processor ii for a for a
certain applicationcertain application
AreaPeriodCCEP
#
1
ProcessorProcessor Clock CycleClock Cycle Clock PeriodClock Period AreaArea EPEP
P1P1 1200012000 6ns6ns 50005000 2.782.78
P2P2 80008000 8ns8ns 80008000 1.951.95
Phase II: Identify Code SegmentsPhase II: Identify Code Segments
• Problem: Problem: – Application has Application has millions of lines of C codemillions of lines of C code, how do , how do
we know which code segment is good for we know which code segment is good for converting to extensible instructionconverting to extensible instruction
• Fitting function Fitting function – Identifies code segments which are Identifies code segments which are suitablesuitable for for
translation to extensible instructionstranslation to extensible instructions– Extracts Extracts characteristicscharacteristics of the code segment of the code segment
Fitting FunctionFitting Function
• The four characteristics:The four characteristics:– The The frequency of usefrequency of use of a code segment of a code segment– The The number of operandsnumber of operands in a code segment in a code segment– The percentage of integer (short) The percentage of integer (short) type operandstype operands in all the in all the
operandsoperands– The percentage of The percentage of bit operationsbit operations in all the operands in all the operands
• The fitting function:The fitting function:
α α – the ideal number of operands in the code segment.– the ideal number of operands in the code segment.
......
1.. OBOT
ONUFctionFittingFun
RelationshipRelationship
• Relationship between the fitting function and the Relationship between the fitting function and the speedup/area ratio of the instructionspeedup/area ratio of the instruction
Voice Encoder
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
FittingFunction
Speedup/area
Phase III: Heuristic Algorithm (Part II)Phase III: Heuristic Algorithm (Part II)
• For selecting extensible instructionsFor selecting extensible instructions• The The potential speedup/area ratiopotential speedup/area ratio, , PSARPSAR, of , of
extensible instruction in processor:extensible instruction in processor:
max
_#_%
LatencyArea
SpeedupCCofPSAR
InstructionInstruction % of total % of total clock cycleclock cycle
SpeedupSpeedup AreaArea LatencyLatency PSARPSAR
Inst 1Inst 1 1313 3x3x 15001500 6ns6ns 4333343333
Inst 2Inst 2 1010 6x6x 25002500 6ns6ns 3428634286
Phase IV: Performance EstimationPhase IV: Performance Estimation
• For rapidly estimating the execution timeFor rapidly estimating the execution time• The The execution time estimationexecution time estimation, , ETEETE, for an extensible , for an extensible
processor with a set of selected extensible processor with a set of selected extensible instruction:instruction:
max
__ Latency
Speedup
affectedCCUnaffectedCCETE
INSIDE: INstruction Selection/Identification &Design Exploration for Extensible Processors
Pre-configured ProcessorLibrary
AreaConstraint
Processor with coprocessor &extensible instructions
Pre-designed InstructionLibrary
Execution Time
Application
ExhaustivelySimulation
All the solutions
Verify the solutionand system
Experimental MethodologyExperimental Methodology
ResultsResults
Results (Design Turnaround Time)Results (Design Turnaround Time)
100 160 110 90 90 45 150
3100
9600
3840
1800 1800
3600
24000
0
5000
10000
15000
20000
25000
adpcmencoder
gsm encoder gsm decoder g721encoder
g721decoder
mpeg2decoder
voicerecognition
Application
Tim
e (
min
ute
)
INSIDE
Full Exploration
Execution Time vs Area
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
0 50,000 100,000 150,000 200,000 250,000 300,000
Area [gates]
Exe
cutio
n T
ime
[se
con
d]
Results (Pareto Points)Results (Pareto Points)
Execution Time vs Area
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
0 50,000 100,000 150,000 200,000 250,000 300,000
Area [gates]
Exe
cutio
n T
ime
[se
con
d]
Results of Results of INSIDEINSIDE system system
• Mediabench applicationsMediabench applications• Speedup of the applications:Speedup of the applications:
– On average On average 2.03x2.03x (up to 7x) (up to 7x)
• Hardware overheads:Hardware overheads:– On average On average 25%25% (up to 72%) (up to 72%)
• Pareto points:Pareto points:– Obtained on average Obtained on average 83%83% (up to 100%) (up to 100%)– On average within On average within 5%5% of the execution time of the execution time
• System-On-Chips Design ChallengesSystem-On-Chips Design Challenges• Extensible Processors PlatformExtensible Processors Platform• Background – XtensaBackground – Xtensa• Problems in Extensible Processors PlatformProblems in Extensible Processors Platform• The Goal of the researchThe Goal of the research• Proposed SolutionProposed Solution• INSIDE INSIDE • MINCEMINCE• ConclusionsConclusions
OutlineOutline
The Aim of the The Aim of the MINCEMINCE
• MMatching atching ININstructions using structions using CCombinational ombinational EEquivalencequivalence
• Automatically Automatically matching code segmentsmatching code segments to pre- to pre-synthesized synthesized specific instructionsspecific instructions..
// Pre-synthesized specific instructions
state total 32
iclass ei EI {out arr, in art, in ars} {in state}
reference EI {
wire [31;0] tmp
assign tmp = TIEmul(art, ars, 1’b0) >> 4;
assign arr = tmp + state;
}
// Application in C
int main() {
… …
// Code segment
for (x = 0; x < 100000; x++) {
tmp = a[x] * b[x] << 4;
total += tmp;
}
… …
} Functional Equivalent
MotivationMotivationApplication in C/C++
Compilation
Analysis and Profiling
Identifying the “hotspots” of the application through simulation,
profiling, and trace
Designing extensible instructions for the “hotspots”
Testing/Verifying the functionality, speedup, area, power consumption
of extensible instruction
Explore extensible processor
design space
OK?
Synthesizable RTL of base processor and extensible instructions
Generate extensible processor
Synthesis and tape out
Proto-typing
Designed Extensible Instruction Library
No Yes
Selecting the extensible instructions based on the design
constraints
Design Constraints (Performance, Area, Power Consumption)
MINCE tool- Matching designed extensible instruction to the code segment
MINCE ToolMINCE Tool
• An An automated toolautomated tool for matching pre- for matching pre-synthesized extensible instruction to the synthesized extensible instruction to the functional equivalencefunctional equivalence of code segments of code segments using combinational equivalence checking in using combinational equivalence checking in the extensible processors platform.the extensible processors platform.
MINCE ToolMINCE Tool
• MINCE consists: MINCE consists: – A A translatortranslator– A A filtering filtering
algorithmalgorithm– A A combinational combinational
equivalence equivalence checkingchecking tool tool
ApplicationSoftware in
C/C++
Functional equivalence implementation
ExtensibleInstruction Library
(Verilog HDL)
Translator
Combinational Equivalence Checking Tool
MINCEtool
I
III
Filtering AlgorithmII
Related WorksRelated Works
• Simulation techniques:Simulation techniques:[Stadler 1999][Stadler 1999]
• Graph matching techniques:Graph matching techniques:[Corazao 1993] [Kang 1995] [Liem 1994] [Shu 1996][Corazao 1993] [Kang 1995] [Liem 1994] [Shu 1996]
• Equivalence verifications:Equivalence verifications:[Clarke 2003], [Pnueli 1998], [Semeria 2002][Clarke 2003], [Pnueli 1998], [Semeria 2002]
Our ContributionsOur Contributions
• Enhances Enhances reusabilityreusability of the extensible of the extensible instructions.instructions.
• MINCE tool is MINCE tool is superiorsuperior to computation- to computation-intensive and error-prone intensive and error-prone simulation simulation approaches.approaches.
• The usage of functional equivalence checking The usage of functional equivalence checking ensures that the results are largely ensures that the results are largely independentindependent of the programming styleof the programming style of the of the application.application.
Phase I: TranslatorPhase I: Translator
• The goal of the translator is to The goal of the translator is to convertconvert the application the application written in C/C++ to a set of code segment in Verilog written in C/C++ to a set of code segment in Verilog HDL using a HDL using a systematic approachsystematic approach..
• To reduce the To reduce the granularitygranularity of the application written in of the application written in C/C++.C/C++.
• The translator consists of The translator consists of fourfour steps: steps:– Separate the application into code segments;Separate the application into code segments;– Compile code segments;Compile code segments;– Convert to register transfer list;Convert to register transfer list;– Map to Verilog HDL file.Map to Verilog HDL file.
TranslatorTranslator
Compile (Step II)
Convert (Step III)
Map (Step IV)
Assembler Instruction HardwareLibrary (Verilog HDL)
... …
module mult (product, input1, input2);… …
endmodule
module add (sum, in1, in2);… …
endmodule
module sfr (out1, in1, amt);… …
endmodule
module sfl … ...
module cmpl … ...
Separate (Step I)
Application Software written in C/C++
// High-level code segmentint example (int sum, int input1, int input2) {
total = sum + (input1 * input2) >> 4;return total;
}
// Assembly Codemult R6, R1, R2mov R4, R6sar R4, $4add R5, R4, R3
// Register Transfer ListR6 = R1 * R2;R4 = R6;R4_1 = R4 >> 4;R5 = R3 + R4_1;
// Verilog Codemodule example (total, sum, input1, input2);
output [31:0] total;input [31:0] sum, input1, input2;wire [31:0] r1, r2, r3, r4, r4_1, r5;wire [63:0] r6;
r1 = input1;r2 = input2;r3 = sum;mult (r6, r1, r2);r4 = r6;sfr (r4_1, r4, 4);add (r5, r3, r4_1);total = r5;
endmodule
module mult (product, in1, in2);output [63:0] product;input [31:0] in1;input [31:0] in2;
… …endmodule
module add (sum, in1, in2);… …
endmodule
module sfr (out1, in1, amt);… …
endmodule
Phase II: Filtering AlgorithmPhase II: Filtering Algorithm
• The goal of the filtering algorithm is to The goal of the filtering algorithm is to eliminateeliminate the the unnecessary and complex Verilog HDL file into the unnecessary and complex Verilog HDL file into the combinational equivalence checking model.combinational equivalence checking model.
• Verilog HDL files can be pruned as non-match:Verilog HDL files can be pruned as non-match:– Differing Differing number of portsnumber of ports;;– Differing Differing port sizesport sizes;;– Insufficient Insufficient number of base hardware modulesnumber of base hardware modules to to
present complex module. present complex module.
Combinational Equivalence CheckingCombinational Equivalence Checking
• Binary Decision DiagramBinary Decision Diagram (BDD) (BDD) • For example,For example,
\\ High-level Code Segment S = (a + b) * 16;
\\ Extensible Instruction S = (a + b) << 4;
…
ai
bi
cii
cii
bi
cii
cii
10
S4
ai
bi
cii
cii
bi
cii
cii
10
S5
ai
bi
cii
cii
bi
cii
cii
10
ai
bi
cii
cii
bi
cii
cii
10
S7
ai
bi
cii
cii
bi
cii
cii
10
S8
ai
bi
cii
cii
bi
cii
cii
10
S9
ai
bi
cii
cii
bi
cii
cii
10
S10
ai
bi
cii
cii
bi
cii
cii
10
S11 S6 S0…S31……
…
ai
bi
cii
cii
bi
cii
cii
10
S4
ai
bi
cii
cii
bi
cii
cii
10
S5
ai
bi
cii
cii
bi
cii
cii
10
ai
bi
cii
cii
bi
cii
cii
10
S7
ai
bi
cii
cii
bi
cii
cii
10
S8
ai
bi
cii
cii
bi
cii
cii
10
S9
ai
bi
cii
cii
bi
cii
cii
10
S10
ai
bi
cii
cii
bi
cii
cii
10
S11 S6 S0…S31……
a1
b1
ci1 ci1
b1
ci1 ci1
10
S5
a1
b1
ci1 ci1
b1
ci1 ci1
10
S5
• Using Verification Interfacing with Synthesis (VIS) Using Verification Interfacing with Synthesis (VIS) which was jointly developed by the University of which was jointly developed by the University of California, in Berkeley and the University of Colorado, California, in Berkeley and the University of Colorado, Boulder.Boulder.
ResultsResults
3
3
3
3
1
1
1
1
11
EM: Exact matchEM: Exact matchFE: Functional equivalenceFE: Functional equivalenceDM: I/O match onlyDM: I/O match onlyTW: Do not matchTW: Do not match
Simulation vs MINCE (Time on H/W instruction with different kinds of S/W code segment)
0
20
40
60
80
100
120
EM FE
DM
TW EM FE
DM
TW EM FE
DM
TW EM FE
DM
TW EM FE
DM
TW EM FE
DM
TW EM FE
DM
TW EM FE
DM
TW EM FE
DM
TW EM FE
DM
TW
Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Instruction 6 Instruction 7 Instruction 8 Instruction 9 Instruction 10
Time (minute)
SimulationMINCE
EM - Exact MatchFE - Functional EquivalentDM - Do Not Match (I/O only)TW - Totally Wrong (Do Not Match at all)
ResultsResults
Matching Time
80 75 74
10595
115
205
25 20 20
40 3521
40
10 8 9
2515 18
25
0
50
100
150
200
250
Adpcmencoder
G721encoder
G721decoder
Gsm encoder Gsm decoder MPEG2encoder
VoiceRecognition
Application
Tim
e (h
ou
r)
Simulation Technique
MINCE (w/o filtering alg.)
MINCE
Future Plan Future Plan
Application in C/C++
Compilation
Analysis and Profiling
Identify:1) Predefined blocks2) Extensible instructions3) Parameter settings
OK?
Synthesizable RTL of: Base processor Predefined blocks Extensible instructions
Generate extensible processor
Synthesis and tape out
Proto-typing
Instruction Library
INSIDE system
Define: 1) P.B.2) E.I.3) P.S.
Select: 1) P.B.2) E.I.3) P.S.
Performance Estimation Model1) Information from profiling2) Synthesized information of instruction
Use libraries of pre-configured processor and
instructions to limit the design space
Fitting Function:- Finds suitable code segment to implement as extensible instruction
MINCE tool- Match code segment to instruction
No Yes
Generate Extensible Instruction Automatically
Area, Latency, Power Model
ConclusionConclusion
• Extensible processors platform enables to Extensible processors platform enables to address three architectural levels in order to address three architectural levels in order to tunetune for for application specificapplication specific: : – Inclusion/Exclusion of predefined blocksInclusion/Exclusion of predefined blocks– Instructions extensionInstructions extension– ParameterizationsParameterizations
• The goal of the research is to The goal of the research is to automateautomate the the design flow of extensible processor platformdesign flow of extensible processor platform..
• INSIDE system / MINCE toolINSIDE system / MINCE tool