data partitioning for reconfigurable architectures with distributed block ram wenrui gong gang wang...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM
Wenrui Gong Gang Wang Ryan KastnerDepartment of Electrical and Computer Engineering
University of California, Santa Barbara{gong, wanggang, kastner}@ece.ucsb.edu
http://express.ece.ucsb.edu
June 10, 2005
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 2
What are we dealing with? Mapping high-level programs into FPGA-based
reconfigurable computing architectures with distributed block RAM modules
Objective: Improve utilizations of available storage resources, optimize system performance, and meet design goals
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 3
Outline
Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 4
Outline
Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 5
Target Architecture FPGA-based fine-grained reconfigurable computing
architecture with distributed block RAM modules
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 6
Memory Access Latencies
Memory access delay including access delay and propagation delays. Propagation delays are variables.
One clock cycle to access near data, or two or even more to access data far away from the CLB.
Difficult to distinguish which ones are near and which ones are remote before physical synthesis More difficult than traditional data partitioning in parallelizing
compilation for NUMA machines
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 7
Outline
Target architectures Data partitioning problem
Problem formulation Data partitioning algorithm
Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 8
Problem Formulation Inputs:
An l-level nested loop L A set of n data arrays N An architecture with m BRAM modules M.
Assumptions: Index expressions of array references are affine functions of
loop indices; No indirect array references, or other similar pointer
operations; All data arrays are assigned to block RAM modules No duplicate data.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 9
Problem Formulation (cont’d) Partitioning problem: partition data arrays N into a
set of data portions P, and seek an assignment from P to block RAM modules M.
Constraints: 1) hardware resource constraint 2) capacity constraint of each block RAM module 3) all data arrays are assigned to block RAM and each data
element is assigned to one and only one block RAM module. Objective: minimize the total execution time (or
maximize the system throughput) under the above constraints.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 10
Overview of Data Partitioning Algorithm
Code analysis to determine possible partitioning directions
Architectural-level synthesis discover the design properties Resource allocation, scheduling and binding
Granularity adjustment Use experiential cost function to estimate performances
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 11
Code Analysis Calculate the iteration space IS(L) Calculate the data space DS(Ni) Obtain data access footprint F using the affine
functions of loop indices Analyze F and IS(L) to obtain a set of possible
partitioning directions.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 12
Architectural-level Synthesis Synthesize and pipeline the innermost iteration
body, and collect execution time T, initial intervals II, and resource utilization um, ur, and ua
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 13
Granularity Adjustment For each possible partitioning direction, check
different granularity to obtain the best performance Calculate the finest and coarsest grain for a homogeneous
partitioning Finest: as less iterations as possible in one block RAM module,
use all block RAM modules
Coarsest: use as less block RAM modules as possible
Estimate global memory accesses mr and total memory accesses mt, and their ratio
Use cost function to estimate the execution time
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 14
Cost Function
An experiential formulation based our architectural-level synthesis results. Estimate initial intervals for pipelined designs Benefit memory accesses to nearby block RAM modules Different resource utilizations and granularities affect the
initial intervals
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 15
Outline
Target architectures Data partitioning problem Memory optimizations
Scalar replacement Data prefetching
Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 16
Scalar Replacement Scalar replacement increases data reuses and
reduces memory access Memory are accessed in the previous iteration Use contents already in registers rather than access it again
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 17
Data Prefetching and Buffer Insertion
Buffer insertion reduces critical paths, and optimizes clock frequencies. Schedule the global memory access one cycle earlier Reduce the length of critical paths
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 18
Outline
Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 19
Experimental Setup Target architecture: Xilinx Virtex II FPGA. Target frequency: 150 MHz. Benchmarks: image processing applications and DSP
SOBEL edge detection Bilinear filtering 2D Gauss blurring 1D Gauss filter SUSAN principle.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 20
Results: Architectural Exploration Correlation bank: Different partitions of the array S deliver a wide
variety of candidate solutions With quite different overall performance after
synthesis and physical design.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 21
Results: Execution Time
The average speedup: 2.75 times, and after further optimizations, the average speedup is 4.80 times faster.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 22
Results: Achievable Clock Frequencies
About 10 percent slower than the original ones. After optimizations, about 7 percent faster than those of partitioned ones.
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 23
Outline
Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 24
Concluding Remarks A data and iteration space partitioning approach for
homogeneous block RAM modules integrated with existing architectural-level synthesis
techniques parallelize input designs dramatically improve system performance
Future work Irregular memory access Heterogeneous block RAM modules
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 25
Thank You
Prof Ryan Kastner and Gang Wang Reviewers All audiences
3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 26
Questions