qemu tcg enhancements for speeding-up the emulation of simd...
TRANSCRIPT
QEmu TCG Enhancements for Speeding-up theEmulation of SIMD instructions
Luc Michel, Nicolas Fournel and Frederic Petrot
TIMA Laboratory
System Level Synthesis Group
DATE’11 W818/03/2011
IntroductionQEmu operation
Improving Neon instructions translationTests and results
Outline
1 IntroductionAbout QEmuAbout SIMD instructions
2 QEmu operationThe intermediaterepresentationThe helpers
3 Improving Neon instructionstranslation
A solution to improve thetranslationIntermediate representationextension choices
4 Tests and resultsTests protocolPerformance measurement
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 2 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
About QEmuAbout SIMD instructions
Outline
1 IntroductionAbout QEmuAbout SIMD instructions
2 QEmu operationThe intermediaterepresentationThe helpers
3 Improving Neon instructionstranslation
A solution to improve thetranslationIntermediate representationextension choices
4 Tests and resultsTests protocolPerformance measurement
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 3 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
About QEmuAbout SIMD instructions
QEmu: a fast and portable dynamic translator
Simulation with QEmu
Open-source simulation and virtualization software,
Dynamic binary translation of the code of a targetarchitecture,
To be executed on an host architecture.
Precise goal of the present work
Accelerate the cross-execution of the Neon instructions.
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 4 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
About QEmuAbout SIMD instructions
What are SIMD instructions?
SIMD Instructions: Single Instruction, Multiple Data
Same operation on multiple data in parallel,
very efficient to optimize some algorithms: parts of mediacodecs, of radio processes, . . . ,
64 bits or 128 bits data vectors,
8, 16, 32, 64 bits data depending on the instructions.
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 5 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
About QEmuAbout SIMD instructions
Example: vadd.i16
Taken from the ARM Neon instruction set
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 6 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
The intermediate representationThe helpers
Outline
1 IntroductionAbout QEmuAbout SIMD instructions
2 QEmu operationThe intermediaterepresentationThe helpers
3 Improving Neon instructionstranslation
A solution to improve thetranslationIntermediate representationextension choices
4 Tests and resultsTests protocolPerformance measurement
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 7 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
The intermediate representationThe helpers
The intermediate representation of QEmu
The intermediate representation of QEmu
Independent intermediaterepresentation consists ofmicro-operations.
add i32
mov i32
or i32
Two steps translation
1 Target architecture code → micro-operations,
2 micro-operations → host architecture code.
Intermediate representation benefits
Independence between targets and hosts architectures.
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 8 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
The intermediate representationThe helpers
Binary translation example
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 9 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
The intermediate representationThe helpers
Neon instructions translation method: the helpers
The helpers
C functions, simulate an instruction,
Compiled as a part of QEmu,
Called when translating the corresponding Neon instruction.
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 10 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
The intermediate representationThe helpers
Example with a helper
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 11 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
The intermediate representationThe helpers
Helpers overhead
Helpers overhead
Function call,
Adapting the arguments,Passing the arguments,Getting the result.
Multiple calls because each 64b/128b vector split into 32bparts
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 12 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
A solution to improve the translationIntermediate representation extension choices
Outline
1 IntroductionAbout QEmuAbout SIMD instructions
2 QEmu operationThe intermediaterepresentationThe helpers
3 Improving Neon instructionstranslation
A solution to improve thetranslationIntermediate representationextension choices
4 Tests and resultsTests protocolPerformance measurement
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 13 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
A solution to improve the translationIntermediate representation extension choices
A solution to improve the translation
The idea
Be able to take advantage of the host SIMD capabilities,
Add some SIMD micro-operations to the QEmu IR,
Translate these micro-operations to host SIMD instructions.
The practical example of this work
ARM Neon instruction set → Intel x86 MMX/SSE instruction set.
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 14 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
A solution to improve the translationIntermediate representation extension choices
How to extend the IR
Choose how to extend the QEmu IR
Adding a micro-operation for each target instruction,
Keep a little IR and add only elementary micro-operations.
Our choice
Try to keep the IR as simple as possible.
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 15 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
A solution to improve the translationIntermediate representation extension choices
Examples of mapping between Neon and MMX/SSE
Direct mapping between two instructions
The most favorable case,
micro-operation with the semantic of these two instructions.
Mapping between vadd.i16 (Neon) and paddw (MMX/SSE)
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 16 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
A solution to improve the translationIntermediate representation extension choices
Examples of mapping between Neon and MMX/SSE
A Neon instruction emits multiple micro-operations
The Neon instruction is not elementary,
split into several elementary micro-operations.
Translating the vsra.u32 (Neon) instruction
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 17 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
A solution to improve the translationIntermediate representation extension choices
Examples of mapping between Neon and MMX/SSE
A micro-operation generates multiple host instructions
No equivalent for this micro-operation on the host,
micro-operation behavior reproduced with host instructions,
Harder to perform with QEmu than previous case.
The simd 128 shl i8 micro-op emits several host instructions
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 18 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
Tests protocolPerformance measurement
Outline
1 IntroductionAbout QEmuAbout SIMD instructions
2 QEmu operationThe intermediaterepresentationThe helpers
3 Improving Neon instructionstranslation
A solution to improve thetranslationIntermediate representationextension choices
4 Tests and resultsTests protocolPerformance measurement
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 19 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
Tests protocolPerformance measurement
What kind of tests?
Unitary tests
Ensure correctness of the translation,
detect regression during the development phase.
Performance measurement
Execution time.
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 20 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
Tests protocolPerformance measurement
Tests environment
Linux in QEmu
Minimalist Linux system,
Cross-compilation toolchain to compile some programs for thetest system.
Real BeagleBoard system
Board embedding an ARMCortex-A8 CPU with Neonextension,
Used to validate our unitary tests.
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 21 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
Tests protocolPerformance measurement
Performance tests
The three chosen instructions
vadd.i16,
vsra.u16,
vshl.u8.
For each instruction. . .
101 assembly functions,
containing 0% to 100% of this Neon instruction,
filled with classical instructions,
executed several times in a loop,
total execution time measured for the helpers and mappingstrategies
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 22 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
Tests protocolPerformance measurement
Performance tests results
0
10
20
30
40
50
60
70
80
90
100
110
0 20 40 60 80 100Rela
tive e
xecu
tion t
ime (
%)
com
pare
d t
o h
elp
ers
SIMD instructions (%)
vadd.i16vsra.u16
vshl.u8
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 23 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
Tests protocolPerformance measurement
Take away message
Conclusion
Results are very encouraging, but Amdahl’s law still rules
What to do next?
Extend the implementation to more SIMD instruction sets,
Probably with the help of automation tools
Call to QEmu development community
Should this approach be promoted into mainstream QEmu?
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 24 / 25
IntroductionQEmu operation
Improving Neon instructions translationTests and results
Tests protocolPerformance measurement
Thanks for your attention
And now ready to answer your questions!
Luc Michel, Nicolas Fournel and Frederic Petrot QEmu TCG Enhancements for SIMD support 25 / 25