the group runtime optimization for high-performance computing an install-time system for automatic...

The

GroupRuntime Optimization for

High-Performance Computing

An Install-Time System for Automatic Generation ofOptimized Parallel Sorting Algorithms

Marek Olszewski and Michael Voss

ECE Department

University of Toronto

PDPTA 2004

Motivation

Sorting is a fundamental algorithm Many algorithmic choices for sorting Performance heavily influenced by

Data being sorted (type, entropy) Target machine being used

How can we build the best sort for a given machine? An empirical install-time system

PDPTA 2004

Outline of Talk

Motivation An Overview of Sorting Algorithms Our install-time empirical system

An adaptive hybrid sequential sort An adaptive hybrid parallel sort

An Evaluation Related Work Conclusions

PDPTA 2004

An overview of sorting algorithms Art of Computer Programming V3 (Knuth)

25 algorithms comprehensively studied Comparison sorts

Lower bound shown to be (n log n) Examples include: insertion sort, quick sort

and merge sort Non-comparison sorts

Can be linear time, i.e. O(n) But require knowing the range of the data Examples include: radix sort and bucket sort

PDPTA 2004

An overview of sorting algorithms Hybrid sorts

Divide and conquer sorts are recursive May be beneficial to switch algorithms Most C++ STL sorts are hybrid sorts

Gnu std::sort is a hybrid sort with pre-defined points to switch between heap sort, quick sort, merge sort and insertion sort

PDPTA 2004

An overview of parallel sorts Ideally, O( (n log n) / p)

If p = n, then O( log n) Several parallel sorts demonstrate this bound, e.g.

Column sort Parallelized sequential sorts often better for low numbers

of processors (our focus).

Parallelized divide and conquer algorithms Effective for small numbers of processors Use a work-queue model Tasks are place in a shared work-queue Idle processors remove tasks from the queue Good load balance

PDPTA 2004

Our install-time system

Start Sample input dataprovided to installer

Specialized decision Function place in library

Time SortsRandom algorithms

at each recursive step

Calculate best sortingalgorithm for each

data aet size

Convert tree to C++

C4.5 createsdecision tree

End

End

Parallel?Time Sorts

Different input sizesand work-share points

Work-share cutoffpoint tree and C++

functions generated

PDPTA 2004

Algorithms available to our hybrid sort:

Algorithm Description

Insertion Sort O(n2) but with small lower order terms. Efficient for small lists.

Merge Sort O(n log n). Subtasks evenly divided by has higher lower-order terms than quick sort.

Quick Sort O(n log n) on average, but is O(n2) worst-case. Has smaller lower-order terms than merge sort.

In-place Merge Sort

O(n log n). Higher constant coefficient than merge sort, but uses less memory.

Heap Sort O(n log n). Non-recursive algorithm. Can do well on medium sized lists. Higher lower-order terms than quick sort.

PDPTA 2004

Hybrid Adaptive Sequential Sort

Use random data to train system Up to 10 million elements Insertion sort not used for large inputs Not all inputs sorted to completion

Dynamic programming used to find best choice Assume best sort at each subsequent step Per step timings were measured

C4.5 decision tree used to analyze this data C4.5 tree converted to C++ template code

PDPTA 2004

Hybrid Adaptive Parallel Sort

Start with sequential hybrid sort Determine work-sharing cutoff point

When should a thread execute its own tasks When should a thread place tasks in work queue

Determines the point at which synchronization costs are no longer amortized by small work

PDPTA 2004

Methodology: Platforms

Sequential platforms Linux 2.4.18 Intel Penitum 4 1.6 GHz Xeon Linux 2.4.24 AMD Athlon XP 1700+ SunOS 5.8 on a 600 MHz Sparc Workstation

Parallel platform 4 processor 1.6 GHz Intel Xeon SMP Modified 2.4.18-smp kernel (allowed binding)

PDPTA 2004

Methodology: Comparisons

Adaptive Hybrid Sequential Sort Adaptive Hybrid Parallel Sort Gnu G++ 2.96 std::sort and std::stable_sort

Also hybrid sorts Complex – not easily parallelized

8 equally sized merge sorts that called std::sort and std::stable_sort in parallel

PDPTA 2004

Serial Non-Optimized (w/o –O) Results

PDPTA 2004

Serial Optimized (w –O) Results

PDPTA 2004

Parallel Work-share Cutoff Point

PDPTA 2004

Parallel Non-Optimized (w/o –O) Results

PDPTA 2004

Parallel Optimized (with –O) Results

PDPTA 2004

Parallel Sort Speedups

PDPTA 2004

Related Work

Install-time empirical optimization systems ATLAS: Level 3 BLAS FFTW: FFT

STAPL: Adaptive Parallel C++ Library Uses decision trees like our approach Uses only single-level sorts, not hybrids Not available for comparison

A Dynamically Tuned Sorting Library (CGO’04) Install-time tuning of sequential sorts Only single-level sorts, not hybrid

PDPTA 2004

Conclusion Presented an install-time system for

empirically constructing a “best” sorting algorithm for a target machine

Competitive with STL sort on 1 processor Better than a parallelized STL sort on

multiple processors

the group runtime optimization for high-performance computing an install-time system for automatic...

Documents

quick sort

heap sort

radix sort

insertion sort slide

bucket sort slide

place merge sort

elements insertion sort

adaptive hybrid parallel