project f2: application performance analysis
DESCRIPTION
Seth Koehler John Curreri Rafael Garcia. Project F2: Application Performance Analysis. Outline. Introduction Performance analysis overview Historical background Performance analysis today Related research and tools RC performance analysis Motivation Instrumentation Framework - PowerPoint PPT PresentationTRANSCRIPT
Project F2: Application Project F2: Application Performance AnalysisPerformance Analysis
Seth KoehlerJohn Curreri
Rafael Garcia
Outline Introduction Performance analysis overview
Historical background Performance analysis today Related research and tools
RC performance analysis Motivation Instrumentation Framework Visualization User’s perspective
Case studies N-Queens Collatz (3x+1) conjecture
Conclusions & References
Introduction Goals for performance analysis in RC
Productively identify and remedy performance bottlenecks in RC applications (CPUs and FPGAs)
Motivations Complex systems are difficult to analyze by hand
Manual instrumentation is unwieldy Difficult to make sense of large volume of raw data
Tools can help quickly locate performance problems Collect and view performance data with little effort Analyze performance data to indicate potential bottlenecks Staple in HPC, limited in HPEC, and virtually non-existent in RC
Challenges How do we expand notion of software performance analysis into
software-hardware realm of RC? What are common bottlenecks for dual-paradigm applications? What techniques are necessary to detect performance bottlenecks? How do we analyze and present these bottlenecks to a user?
Historical Background Gettimeofday and printf
VERY cumbersome, repetitive, manual, not optimized for speed Profilers date back to 70’s with “prof” (gprof, 1982)
Provide user with information about application behavior Percentage of time spent in a function How often a function calls another function
Simulators / Emulators Too slow or too inaccurate Require significant development time
PAPI (Performance Application Programming Interface) Portable interface to hardware
performance counters on modern CPUs Provides information about caches,
CPU functional units, main memory, and more
Processor HW countersUltraSparc II 2Pentium 3 2AMD Athlon 4IA-64 4POWER4 8Pentium 4 18
* Source: Wikipedia
5
Performance Analysis TodayOriginal Application
Instrument
ExecuteMeasure
Analyze (Automatically)Present
Optimize
Measured Data File
Execution Environment
Visualizations
Instrumented Application
Potential Bottlenecks
Analyze (Manually)
Modified Application
Optimized Application
What does performance analysis look like today? Goals
Low impact on application behavior
High-fidelity performance data Flexible Portable Automated Concise Visualization
Techniques Event-based, sample-based Profile, Trace
Above all, we want to understand application behavior in order to locate performance problems!
Related Research and Tools: Parallel Performance Wizard (PPW) Open-source tool developed by UPC Group at
University of Florida Performance analysis and optimization (PGAS*
systems and MPI support) Performance data can be analyzed for
bottlenecks Offers several ways of exploring performance
data Graphs and charts to quickly view high-level
performance information at a glance [right, top] In-depth execution statistics for identifying
communication and computational bottlenecks Interacts with popular trace viewers (e.g.
Jumpshot [right, bottom]) for detailed analysis of trace data
Comprehensive support for correlating performance back to original source code
* Partitioned Global Address Space languages allow partitioned memory to be treated as global shared memory by software.
Motivation for RC Performance Analysis Dual-paradigm applications gaining more
traction in HPC and HPEC Design flexibility allows best use of FPGAs
and traditional processors Drawback: More challenging to design
applications for dual-paradigm systems Parallel application tuning and FPGA core
debugging are hard enough!
DebugPerformance
DebugPerformance
DebugPerformance
Sequential
Parallel
Dual-Paradigm
Difficultylevel
Less
More
No existing holistic solutions for analyzing dual-paradigm applications Software-only views leave out low-level details Hardware-only views provide incomplete
performance information Need complete system view for effective tuning
of entire application
Motivation for RC Performance Analysis
Q: Is my runtime load-balancing strategy working? A: ???
ChipScope waveform
Motivation for RC Performance Analysis Q: How well is my core’s pipelining strategy working?
A: ???
gprof output (×N, one for each node!)
Flat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 51.52 2.55 2.55 5 510.04 510.04 USURP_Reg_poll 29.41 4.01 1.46 34 42.82 42.82 USURP_DMA_write 11.97 4.60 0.59 14 42.31 42.31 USURP_DMA_read 4.06 4.80 0.20 1 200.80 200.80 USURP_Finalize 2.23 4.91 0.11 5 22.09 22.09 localp 1.22 4.97 0.06 5 12.05 12.05 USURP_Load 0.00 4.97 0.00 10 0.00 0.00 USURP_Reg_write 0.00 4.97 0.00 5 0.00 0.00 USURP_Set_clk 0.00 4.97 0.00 5 0.00 931.73 rcwork 0.00 4.97 0.00 1 0.00 0.00 USURP_Init
10
What to Instrument in Hardware? Control
Watch state machines, pipelines, etc. Replicated cores
Understand distribution and parallelism inside FPGA
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
CoreCore
0 1
2 3
FPGA Board
Nod
e
Nod
e
Nod
e
Primary Interconnect
Main Memory
Network
Primary Interconnect
CPU
CPU
NetworkO
n-bo
ard
Mem
ory
CPU & Primary Interconnect
Secondary InterconnectSecondary Interconnect
System
...
Machine
...
Node Board
CPU
...
Mai
n M
emor
y Dev
ice
Inte
rfac
e
App core
App core
App core
...
FPGA / Device
Secondary Interconnect
Boa
rd In
terf
ace
FPGA
FPGA
FPGA
...
Embedded CPU(s)Boa
rd In
terfa
ce
Legend
FPGA Communication
Traditional Processor Communication
CPU
In
terc
onne
ct Top-
leve
l App
On-board FPGA
Communication On-chip (Components, Block RAMs, embedded processors) On-board (On-board memory, other on-board FPGAs or processors) Off-board (CPUs, off-board FPGAs, main memory)
Instrumentation Modifications
Original top-level file
Module
Submodule
User Application (HLL)
CPU(s)FPGA Access Methods
(Wrapper)
Original Application
FPGA(s)
User Application (HDL)
Submodule Submodule
Module
LegendOriginal RC Application
Additions by Instrumentation
User Application
Color Legend
Modified component
interface
Measurement and InterfaceHardware Measurement
Module (HMM)New top-level file
Hardware Measurement Thread / Process
Data Transfer Module
LockFramework
Process is automatable!
Additions are temporary!
Performance Analysis Framework Instrument VHDL source (vs. binary or
intermediate levels) Portable across devices Flexible (access to signals) Low change in area / speed (optimized) Relatively easy Must pass through place-and-route Language specific (VHDL vs. Verilog)
Store data with CPU-initiated transfers (vs. CPU-assisted or FPGA-initiated) Universally supported Not portable across APIs Inefficient (lock contention, wasteful) Lower fidelity
CPU FPGA
Request
Data
Hardware Measurement Extractation Module Separate thread (HMM_Main) periodically transfers data from FPGA to memory Adaptive polling frequency can be
employed to balance fidelity and overhead
Measurement can be stopped andrestarted (similar to stopwatch)
HMM_Init
HMM_Start HMM_Main(thread)HMM_Stop
HMM_FinalizeApp
licat
ion
14
Instrumentation Modifications (cont) New top-level file arbitrates between application and
performance framework for off-chip communication Splice into communication scheme
Acquire address space in memory map Acquire network address or other unique identifier
Connect hardware together Signal analysis
Modify HLL Main File Modify HDL Files
Create new “top-level” fileModify HMM_Main
Synthesize / ImplementCompile
Execute
What / How to Instrument
LegendAutomated Instrumentation
Performed by User
User HLL/HDL Source
What / How to Instrument
Inst
rum
enta
tion
Fram
ewor
k Challenges in
Automation Custom APIs for
FPGAs Custom user schemes
for communication Application knowledge
not available
15
Hardware Measurement Module Tracing, profiling, & sampling with signal analysis
TriggersSignal
Analysis Module
SignalsData
Profile Counters0 1 2 P - 1
Trace Data
Trace Data
Trace Data
Cycle Counter
Module Statistics
Module Control
Request
Perf. Data
Sample Control
signal
value
comp trigger
Buf
fer
Blo
ck R
AM
On-
boar
d M
emor
y (D
DR
/QD
R)
data
...
Visualization Need unified visualizations that accentuate
important statistics Must be scalable to many nodes
CPU 3CPU 2CPU 1CPU 0
904MB/s88%
10MB/s1%
812MB/s79%
914MB/s89%
1.79GB/s72%
6MB/s0%
2.50GB/s100%
0MB/s0%
0MB/s0%
FPGA 0 FPGA 1
2.76GB/s69%
CPU 4 CPU 5
1.98GB/s99%
Network
210MB/s10%
FPGA 2
Throughput (MB/s)
Time (sec)
IDLE75%
PHASE 19% PHASE 2
16%
Potential Bottlenecks CPU Interconnect
121MB/s12%
691MB/s67%
Analysis Instrument and measure to locate common or
expected bottlenecks Provide potential solutions or other aid to mitigate
these bottlenecks Best practices, common pitfalls, etc Hardware/platform specific checks and solutions
Bottleneck Pattern Possible SolutionFPGA idle waiting for data Employ double-buffering
Frequent, small communication packets between CPU & FPGA
Buffer data on CPU or FPGA side
Some cores busy while others idle Improve distribution scheme / load-balancing
Cray XD1 reads slow on CPU Use FPGA to write data
Heavy CPU/FPGA communication Modify partitioning of CPU and FPGA work/data
Excessive time spent in miscellaneous states
Combine states
Performance flow (user’s perspective) Instrument hardware through VHDL Instrumenter GUI
Java/Perl program to simplify modifications to VHDL for performance analysis
Must resynthesize & implement hardware Requires adding in instrumented HDL file via standard tool flow
Instrument software through PPW compiler scripts Run software with ppwupcc instead of standard compiler Use –fpga-nallatech and –inst-functions command line options
VHDL Files
C/UPC Files
Select what and how to record
InstrumentedVHDL Files
InstrumentedExecutable
Configuration
011010101011111101100100
Instrumented FPGA Binary
PerformanceData Files
19
Case Study: N-Queens*
Overview Find number of distinct ways n queens can be placed on
an nxn board without attacking each other Performance analysis overhead
Sixteen 32-bit profile counters One 96-bit trace buffer (completed cores)
Main state machine optimized based on data Improved speedup (from 34 to 37 vs. Xeon code)
N-Queens results for board size of 16
XD1 Xeon-H101Original Instr. Original Instr.
Slices(% relative to device)
9,041 9,901(+4%)
23,086 26,218(+6%)
Block RAM(% relative to device)
11 15(+2%)
21 22(0%)
Frequency (MHz)(% relative to orig.)
124 123(-1%)
101 101(0%)
Communication (KB/s) <1 33 <1 30
Application speedup over single 3.2GHz Xeon
33.9
7.9
37.1
0
5
10
15
20
25
30
35
40
Spee
dup
8-node 3.2GHz Xeon8-node H101Optimized 8-node H101
* Standard backtracking algorithm employed
FPGAs
Case study: Collatz conjecture (3x+1)
Computation FPGA Read FPGA Write FPGA Data Processing
evennnoddnn
nh2/13
)(
FPGARead
FPGAWrite
Application Search for sequences that do not reach 1
under the following function
3.2GHz P4-Xeon CPU with Virtex-4 LX100 FPGA over PCI-X Uses 88% of FPGA slices, 22% (53) of block
RAM, runs at 100MHz Setup
17 counters monitored 3 state machines No frequency degradation observe
Results Frequent, small FPGA communication
31% performance improvement achieved by buffering data before sending to the FPGA
Unexpected...hardware was tuned to work longer to eliminate communication problems
Distribution of data inside FPGA Expected performance increase not large
enough to merit implementation Conclusions
Buffering data achieved 31% increase in speed
Conclusions RC performance analysis is critical to understanding RC
application behavior Need unified instrumentation, measurement, and visualization to
handle diverse and massively parallel RC systems Automated analysis can be useful for locating common RC
bottlenecks (though difficult to do) Framework developed
First RC performance concept and tool framework (per extensive literature review)
Automated instrumentation Measurement via tracing, profiling, & sampling
Application case-studies Observed minimal overhead from tool Speedup achieved due to performance analysis
References R. DeVille, I. Troxel, and A. George. Performance monitoring for run-time management
of reconfigurable devices. Proc. of International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), pages 175-181, June 2005.
Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting bitstreams for debugging FPGA circuits. In Proc. of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 41-50, Washington, DC, USA, Apr. 2001. IEEE Computer Society.
Sameer S. Shende and Allen D. Malony. The Tau parallel performance system. International Journal of High Performance Computing Applications (HPCA), 20(2):287-311, May 2006.
C. EricWu, Anthony Bolmarcich, Marc Snir, DavidWootton, Farid Parpia, Anthony Chan, Ewing Lusk, and William Gropp. From trace generation to visualization: a performance framework for distributed parallel systems. In Proc. of the 2000 ACM/IEEE conference on Supercomputing (CDROM) (SC), page 50, Washington, DC, USA, Nov. 2000. IEEE Computer Society.
Adam Leko and Max Billingsley, III. Parallel performance wizard user manual. http://ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007.
S. Koehler, J. Curreri, and A. George, "Challenges for Performance Analysis in High-Performance Reconfigurable Computing," Proc. of Reconfigurable Systems Summer Institute 2007 (RSSI), Urbana, IL, July 17-20, 2007.