the group runtime optimization for high-performance computing an install-time system for automatic...
TRANSCRIPT
The
GroupRuntime Optimization for
High-Performance Computing
An Install-Time System for Automatic Generation ofOptimized Parallel Sorting Algorithms
Marek Olszewski and Michael Voss
ECE Department
University of Toronto
PDPTA 2004
Motivation
Sorting is a fundamental algorithm Many algorithmic choices for sorting Performance heavily influenced by
Data being sorted (type, entropy) Target machine being used
How can we build the best sort for a given machine? An empirical install-time system
PDPTA 2004
Outline of Talk
Motivation An Overview of Sorting Algorithms Our install-time empirical system
An adaptive hybrid sequential sort An adaptive hybrid parallel sort
An Evaluation Related Work Conclusions
PDPTA 2004
An overview of sorting algorithms Art of Computer Programming V3 (Knuth)
25 algorithms comprehensively studied Comparison sorts
Lower bound shown to be (n log n) Examples include: insertion sort, quick sort
and merge sort Non-comparison sorts
Can be linear time, i.e. O(n) But require knowing the range of the data Examples include: radix sort and bucket sort
PDPTA 2004
An overview of sorting algorithms Hybrid sorts
Divide and conquer sorts are recursive May be beneficial to switch algorithms Most C++ STL sorts are hybrid sorts
Gnu std::sort is a hybrid sort with pre-defined points to switch between heap sort, quick sort, merge sort and insertion sort
PDPTA 2004
An overview of parallel sorts Ideally, O( (n log n) / p)
If p = n, then O( log n) Several parallel sorts demonstrate this bound, e.g.
Column sort Parallelized sequential sorts often better for low numbers
of processors (our focus).
Parallelized divide and conquer algorithms Effective for small numbers of processors Use a work-queue model Tasks are place in a shared work-queue Idle processors remove tasks from the queue Good load balance
PDPTA 2004
Our install-time system
Start Sample input dataprovided to installer
Specialized decision Function place in library
Time SortsRandom algorithms
at each recursive step
Calculate best sortingalgorithm for each
data aet size
Convert tree to C++
C4.5 createsdecision tree
End
End
Parallel?Time Sorts
Different input sizesand work-share points
Work-share cutoffpoint tree and C++
functions generated
PDPTA 2004
Algorithms available to our hybrid sort:
Algorithm Description
Insertion Sort O(n2) but with small lower order terms. Efficient for small lists.
Merge Sort O(n log n). Subtasks evenly divided by has higher lower-order terms than quick sort.
Quick Sort O(n log n) on average, but is O(n2) worst-case. Has smaller lower-order terms than merge sort.
In-place Merge Sort
O(n log n). Higher constant coefficient than merge sort, but uses less memory.
Heap Sort O(n log n). Non-recursive algorithm. Can do well on medium sized lists. Higher lower-order terms than quick sort.
PDPTA 2004
Hybrid Adaptive Sequential Sort
Use random data to train system Up to 10 million elements Insertion sort not used for large inputs Not all inputs sorted to completion
Dynamic programming used to find best choice Assume best sort at each subsequent step Per step timings were measured
C4.5 decision tree used to analyze this data C4.5 tree converted to C++ template code
PDPTA 2004
Hybrid Adaptive Parallel Sort
Start with sequential hybrid sort Determine work-sharing cutoff point
When should a thread execute its own tasks When should a thread place tasks in work queue
Determines the point at which synchronization costs are no longer amortized by small work
PDPTA 2004
Methodology: Platforms
Sequential platforms Linux 2.4.18 Intel Penitum 4 1.6 GHz Xeon Linux 2.4.24 AMD Athlon XP 1700+ SunOS 5.8 on a 600 MHz Sparc Workstation
Parallel platform 4 processor 1.6 GHz Intel Xeon SMP Modified 2.4.18-smp kernel (allowed binding)
PDPTA 2004
Methodology: Comparisons
Adaptive Hybrid Sequential Sort Adaptive Hybrid Parallel Sort Gnu G++ 2.96 std::sort and std::stable_sort
Also hybrid sorts Complex – not easily parallelized
8 equally sized merge sorts that called std::sort and std::stable_sort in parallel
PDPTA 2004
Serial Non-Optimized (w/o –O) Results
PDPTA 2004
Serial Optimized (w –O) Results
PDPTA 2004
Parallel Work-share Cutoff Point
PDPTA 2004
Parallel Non-Optimized (w/o –O) Results
PDPTA 2004
Parallel Optimized (with –O) Results
PDPTA 2004
Parallel Sort Speedups
PDPTA 2004
Related Work
Install-time empirical optimization systems ATLAS: Level 3 BLAS FFTW: FFT
STAPL: Adaptive Parallel C++ Library Uses decision trees like our approach Uses only single-level sorts, not hybrids Not available for comparison
A Dynamically Tuned Sorting Library (CGO’04) Install-time tuning of sequential sorts Only single-level sorts, not hybrid
PDPTA 2004
Conclusion Presented an install-time system for
empirically constructing a “best” sorting algorithm for a target machine
Competitive with STL sort on 1 processor Better than a parallelized STL sort on
multiple processors