designing programmable platforms: from asic to asipflavio/ensino/cmp237/aula20.pdf · current work...

80
Designing Programmable Platforms: From ASIC to ASIP MPSoC 2005 Heinrich Meyr CoWare Inc., San Jose and Integrated Signal Processing Systems (ISS), Aachen University of Technology, Germany

Upload: others

Post on 25-Sep-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Designing Programmable Platforms:From ASIC to ASIP

MPSoC 2005Heinrich Meyr

CoWare Inc., San Joseand

Integrated Signal Processing Systems (ISS),

Aachen University of Technology, Germany

Page 2: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Agenda

Facts & Conclusions

Heterogeneous MPSoC» Energy Efficiency vs.Flexibility» How to explore the Design Space?

ASIP Design

Economics of SoC Development

Conclusions

Agenda

Page 3: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Facts & Conclusion

Page 4: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Core Proposition

ASIP ASIP basedbased PlatformsPlatforms((heterogenousMPSoCheterogenousMPSoC))

Page 5: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Agenda

Facts & Conclusions

Heterogeneous MPSoC» Energy Efficiency vs.Flexibility» How to explore the Design Space?

ASIP Design

Economics of SoC Development

Conclusions

Agenda

Page 6: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Trade-off between Flexibility and Energy -Efficiency

HeterogeneousHeterogeneous MPSoCMPSoC

Page 7: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Architectural Objectives

Need more MOPS/Watt and MOPS/mm² to minimize the global performance measure for battery driven devices

Energy / decoded Bit = (Joule/Bit)

Page 8: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Computational Effiency vs. Flexibility

SourceSource: : T.NollT.Noll, RWTH Aachen, RWTH Aachen

Page 9: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Enabling MP-SoC Design

Page 10: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

block implementationmicro

architecturedomain

•• RTL SynthesisRTL Synthesis

•• MatlabMatlab•• SPWSPW•• System StudioSystem Studio

algorithmdomain

block specificationArchitectureDescriptionLanguage

•• LISATek Processor SynthesisLISATek Processor Synthesis•• ConvergenSC ConvergenSC BuscompilerBuscompiler

High-level IP block design

block implementationmicro

architecturedomain

•• RTL SynthesisRTL Synthesis

block specificationArchitectureDescriptionLanguage

•• LISATek Processor SynthesisLISATek Processor Synthesis•• ConvergenSC ConvergenSC BuscompilerBuscompiler

system application design

algorithmic exploration

System Level Tools I: Application & IP Creation

Page 11: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Systemapplication design

System Level Tools II: MP-SoC Platform Design

•• MatlabMatlab•• SPWSPW••System StudioSystem Studio

block implementationmicro

architecturedomain

•• RTL SynthesisRTL Synthesis

High-level IP block designblock implementation

microarchitecture

domain•• RTL SynthesisRTL Synthesis

block specificationArchitectureDescriptionLanguage

•• LISATek Processor SynthesisLISATek Processor Synthesis•• ConvergenSC ConvergenSC BuscompilerBuscompiler

algorithmic exploration

virtual prototype

SystemCTransaction

LevelModeling •• ConvergenSCConvergenSC Platform CreatorPlatform Creator

abstract architecture •• MPMP--SoCSoC Intermediate RepresentationIntermediate Representation

algorithmdomain

MP-SoC platform design

abstract architecture

virtual prototype

SystemCTransaction

LevelModeling

•• MPMP--SoCSoC Intermediate RepresentationIntermediate Representation

•• ConvergenSCConvergenSC Platform CreatorPlatform Creator

System Level Tools I: Application & IP Creation

Page 12: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Agenda

Facts & Conclusions

Heterogeneous MPSoC» Energy Efficiency vs.Flexibility» How to explore the Design Space?

ASIP Design

Economics of SoC Development

Conclusions

Agenda

Page 13: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Processor Design Space

MMU

Memory Peripheral

Core Cache

FEFE DCDC EXEX WBWB

• Bypass ?

• Pipeline length ?• Shared resources ?• Parallel execution units ?

which cache required ?

bus fast enough?

butterfly 0 load/storebutterfly 1

communication?

• Exploit regularity/parallelism in data flow/data storage

• VLIW, SIMD, ? • Which instructions for compiler support?• Instruction Encoding?• How much general purpose registers?

• Area constraints met?• Clock frequency?

Instruction Set Design Micro Architecture Design

RTL Design Soc Integration

- Instruction-Set Design- Compiler Design

- Instruction-Set Design- Compiler Design -Micro Architecture Design-Micro Architecture Design

-RTL Design- RTL ISS Co-verification

-RTL Design- RTL ISS Co-verification

-System Integration- Embedded Software

Simulation

-System Integration- Embedded Software

Simulation

Optimal design requires powerful toolsand automation !

Optimal design requires powerful toolsand automation !

MESCAL 2:MESCAL 2:InclusivelyInclusively identifyidentify the the architecturalarchitectural spacespace

Page 14: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

The purpose of an architecture description language (e.gLISA) is:

» To allow for an iterative design to efficiently explore architecture alternatives

» To jointly design “Architecture –Compiler” and on chip communication

» To automatically generate hardware (path to implementation)

» To automatically generate tools» Assembler ,Linker, Compiler, Simulator, co-simulation

interfaces

From a single model at various level of temporal and spatial abstraction

Architecture Description Language based Processor Design

MESCAL 3:MESCAL 3:EfficientlyEfficiently describedescribethe ASIPthe ASIP

Page 15: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

very detailed

no details

LISA 2.0 - Abstraction Levels

time

highlevel

model

PseudoPseudoInstructionsInstructions

ProcessorProcessorInstructionsInstructions

CyclesCycles PhasesPhases

PseudoPseudoResourcesResources(e.g. c(e.g. c--variables)variables)

Functional units,Functional units,Registers,Registers,MemoriesMemories

+ Pipelines+ Pipelines

+ IRQ, etc.+ IRQ, etc.

instructionaccurate

model

cycleaccurate

model

phaseaccurate

model

architecture

accu

racy

accu

racy

Page 16: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

FFT Processor

Application

SoftwareTool Chain

SoftwareTool Chain

LISATekLISATek

Processor Processor DesignerDesigner

RTL RTL

ExecutableExecutableSoftwareSoftwarePlatformPlatform

RTLRTLSoCSoCIntegration KitIntegration Kit((e.g.:SystemCe.g.:SystemC))

DSP SampleVLIW Sample

RISC Sample

Empty Model

LISATek IP LISATek IP SamplesSamples

CustomProcessor

Model(LISA 2.0language)

GenerateGenerateToolsTools

Function and instruction levelFunction and instruction levelprofiling reveals hotprofiling reveals hot--spotsspots--> special purpose instructions> special purpose instructions

Describe/AdoptDescribe/AdoptProcessor ModelProcessor Model

Generate...Generate...

Rapid modeling and re-targetable simulation + code-generation allows for:joint optimization of application and architecture

Rapid modeling and reRapid modeling and re--targetabletargetable simulation + codesimulation + code--generation allows for:generation allows for:joint optimization of application and architecturejoint optimization of application and architecture

MESCAL 3:MESCAL 3:EfficientlyEfficiently describedescribeand and evaluateevaluate the the ASIPASIP

MESCAL 5:MESCAL 5:SucessfullySucessfully deploydeploythe ASIPthe ASIP

Page 17: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Current Work

Evaluation ResultsChip Area, Clock Speed,

Power Consumption

SystemC, VHDL, Verilog Output

Gate Level Synthesis

Target Architecture

LISA Description

Evaluation ResultsProfile Information,

Application Performance

Model Verification& Evaluation

LISA CompilerC-Compiler

AssemblerLinker

Simulator

EXPLORATION

IMPLEMENTATION

Optimization

HDL Generator

•Instruction Set Synthesis

•Memory architecture•Verification

MESCAL 3:MESCAL 3:……....evaluateevaluate the ASIPthe ASIP

Page 18: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

JuneJune 1010,, 20042004

A Novel Approach for Flexible and A Novel Approach for Flexible and Consistent ADLConsistent ADL--driven ASIP Designdriven ASIP Design

Gunnar BraunGunnar BraunAchim NohlAchim Nohl

CoWare, IncCoWare, IncDAC Booth #1844 DAC Booth #1844 www.CoWare.comwww.CoWare.com

Weihua Sheng, Jianjiang Ceng, Manuel Hohenauer,Weihua Sheng, Jianjiang Ceng, Manuel Hohenauer,Hanno Scharwächter, Rainer Leupers, Heinrich MeyrHanno Scharwächter, Rainer Leupers, Heinrich Meyr

Integrated Signal Processing Systems (ISS)Integrated Signal Processing Systems (ISS)AachenAachen University of TechnologyUniversity of Technology

GermanyGermany

Page 19: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

IntroductionIntroduction

Architecture Description Languages (ADL)Architecture Description Languages (ADL)

•• Automatic generation of Software ToolkitAutomatic generation of Software Toolkit(Compiler, Assembler, Linker, IS(Compiler, Assembler, Linker, IS--Simulator)Simulator)

•• Architecture ExplorationArchitecture Exploration

•• SystemC models, RTL code, verification tools, ...SystemC models, RTL code, verification tools, ...

Challenges:Challenges:

•• Different tools need different informationDifferent tools need different information

•• Unambiguous, redundancyUnambiguous, redundancy--free free architecturearchitecture modelmodel(rather than (rather than tools descriptiontools description))

•• Multiple abstraction levels (instructionMultiple abstraction levels (instruction--accurateaccurateand/or cycleand/or cycle--accurate)accurate)

Page 20: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Tool Requirements: Compiler

++

rsrs rtrt

rdrd

add rd = rs, rtadd rd = rs, rt

**

rsrs rtrt

rdrd

mul rd = rs, rtmul rd = rs, rt

LDLD

@@

rdrd

ld rd = @ld rd = @

STST

rsrs

@@

st @ = rsst @ = rsC CompilerC CompilerC Compiler

a = b + c;a = b + c;a = b + c; CC

add c = a, badd c = a, badd c = a, b AssemblyAssembly

Page 21: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Tool Requirements: Simulator

add rd = rs, rtadd rd = rs, rtALU_read (rs, rt);ALU_read (rs, rt);ALU_add ();ALU_add ();Update_flags ();Update_flags ();writeback (rd);writeback (rd);

mul rd = rs, rtmul rd = rs, rtMUL_read (rs, rt);MUL_read (rs, rt);MUL_add ();MUL_add ();Update_flags ();Update_flags ();writeback (rd);writeback (rd);

ld rd = @ld rd = @LSU_addrgen();LSU_addrgen();data_bus.req();data_bus.req();data_bus.read();data_bus.read();writeback (rd);writeback (rd);

st @ = rsst @ = rsLSU_addrgen();LSU_addrgen();LSU_read(rs);LSU_read(rs);data_bus.req();data_bus.req();data_bus.write(rs);data_bus.write(rs);

SimulatorSimulatorSimulator

add r5 = r2, r1add r5 = r2, r1add r5 = r2, r1Machine CodeMachine Code

ALU_read (r2, r1);ALU_add ();

Update_flags ();writeback (r5);

ALU_read (r2, r1);ALU_read (r2, r1);ALU_add ();ALU_add ();

Update_flags ();Update_flags ();writeback (r5);writeback (r5);

Simulation Code (C)Simulation Code (C)

Page 22: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

ADL Model

C CompilerC CompilerC Compiler

a = b + c;a = b + c;a = b + c;

add c = a, badd c = a, badd c = a, b

SimulatorSimulatorSimulator

add r5 = r2, r1add r5 = r2, r1add r5 = r2, r1

add rd = rs, rtadd rd = rs, rtALU_read (rs, rt);ALU_read (rs, rt);ALU_add ();ALU_add ();Update_flags ();Update_flags ();writeback (rd);writeback (rd);

++

rsrs rtrt

rdrd

SYNTAX {“ADD“ dst, src1, src2

}

CODING {0b0010 dst src1 src2

}

BEHAVIOR { ALU_read (src1, src2);ALU_add ();Update_flags ();writeback (dst);

}

SEMANTICS {src1 + src2 dst;

}

SYNTAX {“ADD“ dst, src1, src2

}

CODING {0b0010 dst src1 src2

}

BEHAVIOR { ALU_read (src1, src2);ALU_add ();Update_flags ();writeback (dst);

}

SEMANTICS {src1 + src2 dst;

}

ADL ModelADL Model

ALU_read (r2, r1);ALU_add ();

Update_flags ();writeback (r5);

ALU_read (r2, r1);ALU_read (r2, r1);ALU_add ();ALU_add ();

Update_flags ();Update_flags ();writeback (r5);writeback (r5);

Page 23: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Problem Statement

•• Compiler and Simulator need different information:Compiler and Simulator need different information:•• Compiler: C operation to instruction(s)Compiler: C operation to instruction(s)

WHATWHAT is the instruction good for? Purpose?is the instruction good for? Purpose?

•• Simulator: instructions to sequence of operationsSimulator: instructions to sequence of operationsHOWHOW is the instruction executed? What actions to perform?is the instruction executed? What actions to perform?

•• Architecture Designer‘s Perspective:Architecture Designer‘s Perspective:

?????????

src1 + src2 dst;src1 + src2 dst;

ALU_read (src1, src2);ALU_add ();Update_flags ();writeback (dst);back (dst);

ALU_read (src1, src2);ALU_add ();Update_flags ();write

Page 24: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Examples

Page 25: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

ASDSP FPGA Implementation

ASDSP Core Design

FPGA Implementation

iProve Xilinx xc2v6000

Support the Special Instruction Set for FFT Operation and the BMU InstructionImprove the Performance for OFDM Communication

SEC 0.18um Synthesis• Gate : 77,000• Program Memory : 4 Kbyte, Data Memory : 8 Kbyte

• Frequency : 290MHz

• Power consumption : 0.87W (3mW/MHz)

MyjungMyjung Sunwoo, Sunwoo, AjiouAjiou University,University,

Page 26: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

The ICORE

A low-power ASIP for Infineon DVB-T 2nd

generation Single-Chip Receiver:

• ASIP for DVB-T acquisition and tracking algorithms (sampling-clock-synchronization, interpolation / decimation, carrier frequency offset estimation)

• Harvard Architecture• 60 mostly RISC-like Instructions &

Special Instructions for CORDIC-Algorithm• 8x32-Bit General Purpose Registers, 4x9-Bit Address Registers• 2048x20-Bit Instruction ROM, 512x32-Bit Data Memory• I2C Registers and dedicated interfaces for external communication

Page 27: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Increasing SW Content- but How?The Motorola M68HC11

Architecture

The Motorola M68HC11 Architecture

Page 28: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Architecture Overview

M68HC11 CPU Architecture :» 8-bit micro-controller.

» Harvard Architecture

» 7 CPU Registers.» 6 different Addressing Modes.» Shared data and program bus. :» Instruction width : 8,16, 24, 32, 40 :» 8-bit opcode : 181 instructions» Clock speed : ~200 MHz» Performance : :» Area : 15K to 30K (DesignWare® Library)

Hot spots

stalled data accessmulti-cycle fetch

non-pipelined

Page 29: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Architecture Development with LISA

FE DC

512Bytes int. RAM

64Bytes Conf. Reg.

3.5K ext. RAM

61K ext. RAM

16

32

16

32

0x0000

0x10000

ACCU

Index XIndex Y

Stack Pointer

Condition

Accu BAccu A

EX3232

+ pipelined architecture+ separate program and data bus+ pipelined architecture+ separate program and data bus

Page 30: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Results

•Area < 23k gates

•Clock speed ~ 200 MHz

•Execution time speed up 62 % for spanning tree application

•Mapped onto Xilinx FPGA

Page 31: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Architecture Development with LISA

•Studying the architecture

•Basic architecture modifications

•Grouping and coding of the instructions

•Writing the LISA model

-basic syntax and coding

-behavior section

•Validation

•HDL Generation Total

4 days

2 days

1 day

4 days

6 days

4 days

2 days

23 days

Page 32: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

Institute for Integrated Signal Processing Systems

Design of Application SpecificProcessor Architectures

Rainer LeupersRWTH Aachen University

Software for Systems on [email protected]

Page 33: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

42005 © R. Leupers

Overview

1. Introduction2. ASIP design methodologies3. Software tools4. ASIP architecture design5. Case study6. Advanced research topics

Page 34: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

52005 © R. Leupers

1. Introduction

Page 35: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

62005 © R. Leupers

Embedded system design automation

Embedded systemsSpecial-purpose electronic devicesVery different from desktop computers

Strength of European IT marketTelecom, consumer, automotive, medical, ...Siemens, Nokia, Bosch, Infineon, ...

New design requirementsLow NRE cost, high efficiency requirementsReal-time operation, dependabilityKeep pace with Moore´s Law

Page 36: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

72005 © R. Leupers

What to do with chip area ?

Page 37: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

82005 © R. Leupers

Example: wireless multimedia terminals

Multistandard radioUMTSGSM/GPRS/EDGEWLANBluetoothUWB…

Multimedia standardsMPEG-4MP3AACGPSDVB-H…

Key issues:

• Time to market (≤ 12 months)

• Flexibility (ongoing standardupdates)

• Efficiency (battery operation)

Key issues:

• Time to market (≤ 12 months)

• Flexibility (ongoing standardupdates)

• Efficiency (battery operation)

Page 38: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

92005 © R. Leupers

Application specific processors (ASIPs)

„As the performance of conventional microprocessors improves, theyfirst meet and then exceed the requirements of most computingapplications. Initially, performance is key. But eventually, other factors, like customization, become more important to the customer...“

[M.J. Bass, C.M. Christensen: The Future of the Microprocessor Business, IEEE Spectrum 2002]

design budget = (semiconductor revenue) × (% for R&D)growth ≈ 15% ≈ 10%

# IC designs = (design budget) / (design cost per IC)growth ≈ 50-100% growth ≈ 15%

[Keutzer05]

→ Customizable application specific processors as reusable, programmable platforms

Page 39: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

102005 © R. Leupers

Efficiency and flexibility

Source: T.Noll, RWTH Aachen

HW Design

SWDesign

DigitalSignal

Processors

GeneralPurpose

Processors

103 . . . 104

Log

P O

W E

R

D I

S S

I P

A T

I O

N

105

. . .

106

ApplicationSpecific

ICs

PhysicallyOptimized

ICs

FieldProgrammable

Devices

Log

F L

E X

I B

I L

I T Y

Application Specific Instruction

Set Processors

Why use ASIPs?• Higher efficiency for given rangeof applications• IP protection• Cost reduction (no royalties)• Product differentiation

Log P E R F O R M A N C E

Page 40: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

122005 © R. Leupers

2. ASIP designmethodologies

Page 41: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

132005 © R. Leupers

ASIP architecture exploration

Linker

Assembler

Compiler

Simulator

Profiler

Application

Linker

Assembler

Compiler

Simulator

Profiler

Application

initial processorarchitecture

Linker

Assembler

Compiler

Simulator

Profiler

Application

optimizedprocessor

architecture

Page 42: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

142005 © R. Leupers

Expression (UC Irvine)

Page 43: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

152005 © R. Leupers

Tensilica Xtensa/XPRES

Source: Tensilica Inc.

Page 44: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

162005 © R. Leupers

MIPS CorXtend/CoWare CorXpert

CorExtend Module

+

Profileand

identify custom

instructions

Hotspot

1

User Defined Instruction

User Defined Instruction

Replace critical codewith specialinstruction

2

Synthesize HW and profilewith

MIPSsimand

extensions

3

Page 45: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

172005 © R. Leupers

CoWare LISATek ASIP architecture exploration

Integrated embedded processor development environment Unified processor model in LISA 2.0 architecture description language (ADL)Automatic generation of:

SW toolsHW models

Page 46: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

182005 © R. Leupers

LISA operation hierarchy

addr cond opcode opnds

imm linear cycl control arithm move short long

add sub mul and or

main

decode

Reflects hierarchicalorganization of ISAs

Page 47: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

192005 © R. Leupers

LISA operations structure

LISA operation

BEHAVIOR

Computation and processor state update

SYNTAXAssembly syntax

CODINGBinary coding

DECLAREReferences to other operations

EXPRESSION

Resource access, e.g. registers

ACTIVATION

Initiate “downstream” operations in pipe

SEMANTICS

C compiler generation

Page 48: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

202005 © R. Leupers

LISA operation example

OPERATION ADD{

DECLARE{

GROUP src1, src2, dest = { Register } }CODING { 0b1011 src1 src2 dest }

SYNTAX { “ADD” dest “,” src1 “,” src2 }

BEHAVIOR { dest = src1 + src2; }}

OPERATION Register{

DECLARE{

LABEL index; }

CODING { index }

SYNTAX { “R” index }EXPRESSION{ R[index] }

}

C/C++ Code

ADD

Register Register Register

src1src1 src2src2 destdest

Page 49: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

212005 © R. Leupers

Exploration/debugger GUI

• Application simulation• Debugging• Profiling• Resource utilization analysis• Pipeline analysis• Processor model debugging• Memory hierarchy exploration• Code coverage analysis• ...

• Application simulation• Debugging• Profiling• Resource utilization analysis• Pipeline analysis• Processor model debugging• Memory hierarchy exploration• Code coverage analysis• ...

Page 50: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

222005 © R. Leupers

Some available LISA 2.0 models

DSP:Texas Instruments TMS320C54x

Analog DevicesADSP21xx

Motorola 56000

RISC:MIPS32 4K

ESA LEON SPARC 8

ARM7100

ARM926

• VLIW:

– Texas Instruments TMS320C6x

– STMicroelectronicsST220

• µC:

– MHS80C51

• ASIP:

– Infineon PP32 NPU

– Infineon ICore

– MorphICs DSP

Page 51: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

232005 © R. Leupers

3. Software tools

Page 52: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

242005 © R. Leupers

Tools generated from processor ADL model

Linker

Assembler

Compiler

Simulator

Profiler

Application

Page 53: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

252005 © R. Leupers

Instruction set simulation

Interpretive:• flexible• slow (~ 100 KIPS) Memory

ExecuteDecodeApplication Instruction

Run-TimeRun-Time

Compiled:• fast (> 10 MIPS)• inflexible • high memory

consumption

CompiledSimulation

Application

Compile-TimeCompile-Time Run-TimeRun-Time

ProgramMemory

SimulationCompiler Execute

Instruction BehaviorInstruction BehaviorInstruction Behavior

JIT-CCS™:• „just-in-time“

compiled• SW simulation cache• fast and flexible

CompiledSimulation

Cache

Run-TimeRun-Time

ProgramMemory

Application Decode

Instruction Instruction BehaviorInstructionInstruction Instruction Behavior

Execute

Page 54: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

262005 © R. Leupers

JIT-CC simulation performance

0

1

2

3

4

5

6

7

8

9

Compil

edInt

erpret

ive 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

0

10

20

30

40

50

60

70

80

90

100

Cache size [records]

Perf

orm

ance

[MIP

S]C

acheM

issR

atio[%

]

• Dependent on simulation cache size• 95% of compiled simulation performance @ 4096 cache

blocks (10% memory consumption of compiled sim.)• Example: ST200 VLIW DSP

Page 55: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

272005 © R. Leupers

Why care about C compilers?

Embedded SW design becoming predominant manpowerfactor in system designCannot develop/maintain millions of code lines in assemblylanguageMove to high-level programming languages

Page 56: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

282005 © R. Leupers

Why care about compilers?

Trend towards heterogeneous multiprocessor systems-on-chip (MPSoC)Customized application specific instruction set processors(ASIPs) are key MPSoC componentsHow to achieve efficient compiler support for ASIPs?

ASICASIC CPUCPU ASIPASIP

CPUCPUASIPASIP ASIPASIP

MemoryMemory MemoryMemory MemoryMemory

ASICASIC CPUCPU

MemMem

Page 57: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

292005 © R. Leupers

C compiler in the exploration loop

„„Compiler/Architecture CoCompiler/Architecture Co--DesignDesign““

Efficient C-compilers cannot bedesigned for ARBITRARY architectures!

ApplicationApplicationSoftwareSoftware CompilerCompiler ProcessorProcessor ResultsResults

Compiler and processor form a UNIT that needs to beoptimized!“Compiler-friendliness“ needs to be taken into accountduring the architecture exploration!

Page 58: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

302005 © R. Leupers

Retargetable compilers

source code

asmcode

CompilerCompiler

processormodel

Retargetable compiler

source code

asmcode

Classical compiler

CompilerCompilerprocessor

model

Page 59: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

312005 © R. Leupers

GNU C compiler (gcc)

• Probably the most widespread retargetable compiler

• Mostly used as a native Unix/Linux compiler, but may operate as a cross-compiler, too

• Support for C/C++, Java, and other languages

• Comes with comprehensive support software, e.g. runtime and standard libraries, debug support

• Portable to new architectures by means of machine description file and C support routines

“The main goal of GCC was to make a good, fast compiler for

machines in the class that the GNU system aims to run on: 32-bit

machines that address 8-bit bytes and have several general registers.

Elegance, theoretical power and simplicity are only secondary.”

“The main goal of GCC was to make a good, fast compiler for

machines in the class that the GNU system aims to run on: 32-bit

machines that address 8-bit bytes and have several general registers.

Elegance, theoretical power and simplicity are only secondary.”

Page 60: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

342005 © R. Leupers

CoSy compiler system (ACE)

© ACE - Associated Compiler Experts

• Universal retargetable C/C++ compiler

• Extensible intermediate representation (IR)

• Modular compiler organization

• Generator (BEG) for code selector, register allocator, scheduler

Page 61: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

362005 © R. Leupers

LISATek C compiler generation

Autom. analyses

Manual refinement

GUI

CoSy systemCoSy system

C CompilerC Compiler

LISAprocessor model

SYNTAX {“ADD“ dst, src1, src2

}

CODING {0b0010 dst src1 src2

}

BEHAVIOR { ALU_read (src1, src2);ALU_add ();Update_flags ();writeback (dst);

}

SEMANTICS {src1 + src2 dst;

}

SYNTAX {“ADD“ dst, src1, src2

}

CODING {0b0010 dst src1 src2

}

BEHAVIOR { ALU_read (src1, src2);ALU_add ();Update_flags ();writeback (dst);

}

SEMANTICS {src1 + src2 dst;

}

Page 62: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

372005 © R. Leupers

LISATek compiler generation

Frontend Opt Backend

ASM-CodeLD R1, [R2]ADD R1, #1SHL R1, #3…

C-Codeint a,b,c;a = b+1;c = a<<3;…

Code-Selector

Register-Allocator Scheduler

Instruction-Fetch

Mem

ALUFE DE EX

WBWrite-Back

Pipeline Control

Decoder

Registers

Decoder

Jump

DataRAM

ProgRAM

ADD …

…R[i] …

…#1

R[0..31]

JMPADDSUBSUB MUL

JMP 2 1

ADD 2 3

Page 63: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

382005 © R. Leupers

Compiled code quality: MIPS example

LISATek generated C-CompilerOut-of-the-box C-CompilerNo manual optimizationsDevelopment time of model

approx. 2 weeks

LISATek generated C-CompilerOut-of-the-box C-CompilerNo manual optimizationsDevelopment time of model

approx. 2 weeks

gcc C-Compilergcc with MIPS32 4kc backendUsed by most MIPS usersLarge group of developers,

several man-years of optimization

gcc C-Compilergcc with MIPS32 4kc backendUsed by most MIPS usersLarge group of developers,

several man-years of optimization

Cycles

0

20.000.000

40.000.000

60.000.000

80.000.000

100.000.000

120.000.000

140.000.000

gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2

Cycles

Size

0

10.000

20.000

30.000

40.000

50.000

60.000

70.000

80.000

gcc,-O4 gcc,-O2 cosy,-O4 cosy,-O2

SizeOverhead of 10% in cycle count and 17% in code densityOverhead of 10% in cycle count and 17% in code density

Page 64: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

392005 © R. Leupers

Demands on code quality

Compilers for embedded processors have to generateextremely efficient code

Code size: » system-on-chip» on-chip RAM/ROM

Performance:» real-time constraints

Power/energy consumption:» heat dissipation» battery lifetime

Page 65: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

402005 © R. Leupers

Compiler flexibility/code quality trade-off

variety ofembeddedprocessors

specialization

DSP NPU VLIW

dedicatedoptimizationtechniques

retargetablecompilation

unification

Page 66: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

412005 © R. Leupers

Adding processor-specific code optimizations

High-level (compiler IR)Enabled by CoSy´s engine concept

Low-level (ASM):

.C.C LISA CCompilerLISA C

Compiler Unscheduled.asm

Unscheduled.asm

Binary Code Generation

AssemblerAssembler LinkerLinker .out

Assembly API

Optimization 3Optimization 3Optimization 2Optimization 2Optimization 1Optimization 1Scheduled &Optimized

.asm

Scheduled &Optimized

.asm

Page 67: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

472005 © R. Leupers

4. ASIP architecture design

Page 68: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

482005 © R. Leupers

ASIP implementation after exploration

Page 69: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

492005 © R. Leupers

Unified Description Layer

G a t e – L e v e l

Register-Transfer-Level

L I S A

HDL Generation

Gate–Level Synthesis(e.g. SYNOPSYS design compiler)

Page 70: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

502005 © R. Leupers

Challenges in Automated ASIP Implementation

Instructions

Arithmetic Control

Mul

Mac

JMP

BRC

Independent description of instruction behavior:

+ Efficient Design Space Exploration

ADL:

1:1Mapping

HDL:

Multiplier(MUL)

Multiplier(MAC)

Independent mapping tohardware blocks:

- Insufficient architectural efficiencyby 1:1 mapping

Page 71: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

512005 © R. Leupers

Unified Description Layer

G a t e – L e v e l

Register-Transfer-Level

Unified Description Layer

L I S A

Structure & Mapping(incl. JTAG/DEBUG)

Optimizations

Backend (VHDL, Verilog, SystemC)

Gate–Level Synthesis(e.g. SYNOPSYS design compiler)

Page 72: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

522005 © R. Leupers

Optimization strategies

LISA: separate descriptionsfor separate instructions

Goal: share hardware forseparate instructions

Instruction A Instruction B

LISA Operation A

LISA Operation B

MutualExclusiveness

+

a b

x

+

c d

yPossible Optimizations• ALU Sharing

x,y

+

a c b d

Page 73: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

532005 © R. Leupers

Optimization strategies

AddressA

DataA

Register ArrayDataB

AddressB

LISA Operation A

LISA Operation B

Instruction A Instruction B

Path PA

Path PB

……

LISA: separate descriptionsfor separate instructions

Goal: same hardware forseparate instructions

Possible Optimizations• ALU Sharing• Path Sharing• ...

MutualExclusiveness

DataA, DataB

AddressA

AddressBRegister Array

ResourceSharing

Page 74: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

542005 © R. Leupers

5. Case study

Page 75: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

552005 © R. Leupers

Motorola 6811

Project Goals:

• Performance (MIPS) must be increased

• Compatibility on the assembly levelfor reuse of legacy code(Integration into existing tool flow)

• Royalty free design

compatible architecture developed with LISA using RTL processor synthesis

Page 76: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

562005 © R. Leupers

Motorola 6811

68116812

010010101001101011100101101011110000110110110100

legacy code

?

compiler

assembly

assembler

Increase

Performance!!!

(MIPS)Increase

Performance!!!

(MIPS)

Page 77: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

572005 © R. Leupers

Motorola 6811

010010101001101011100101101011110000110110110100

Bluetooth app.

SynthesizedArchitecture

6811 compiler

assembly

assembler

LISA

assembly levelcompatible

Page 78: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

582005 © R. Leupers

Architecture Development

original 6811 Processor LISA 6811 Processor

8 bit instructions 16 bit instructions

16 bit instructions 32 bit instructions

24 bit instructions

32 bit instructions

40 bit instructions

Instruction is fetched by 8 bit blocks:

up to 5 cycles for fetching!

Instruction is fetched by 8 bit blocks:

up to 5 cycles for fetching!

16 bit are fetched simultaneously:

max 2 cycles for fetching!

+ pipelined architecture+ possibility for special instructions

16 bit are fetched simultaneously:

max 2 cycles for fetching!

+ pipelined architecture+ possibility for special instructions

Page 79: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

592005 © R. Leupers

Tools Flow and RTL Processor Synthesis

C-Application

6811 compiler

AssemblyLISA model

LISA assembler

Executable

LISA tools

6811 compatible architecturegenerated completely in VHDL

1) VLSI Implementation: Area: <17kGates

Clock Speed: ~154 MHz2) Mapped onto XILINX FPGA

Page 80: Designing Programmable Platforms: From ASIC to ASIPflavio/ensino/cmp237/aula20.pdf · Current Work Evaluation Results Chip Area, Clock Speed, Power Consumption SystemC, VHDL, Verilog

752005 © R. Leupers

References

R. Leupers: Code Optimization Techniques for Embedded Processors - Methods, Algorithms, and Tools, Kluwer, 2000R. Leupers, P. Marwedel: Retargetable Compiler Technology for Embedded Systems - Tools and Applications, Kluwer, 2001A. Hoffmann, H. Meyr, R. Leupers:Architecture Exploration for Embedded Processors with LISA, Kluwer, 2002C. Rowen, S. Leibson: Engineering the Complex SoC: Fast, Flexible Design with Configurable Processors, Prentice Hall, 2004M. Gries, K. Keutzer, et al.: Building ASIPs: The Mescal Methodology, Springer, 2005P. Ienne, R. Leupers (eds.): Customizable and Configurable Embedded Processor Cores, Morgan Kaufmann, to appear 2006