yuanyuan sun feiteng yang

24
Using SIMD Registers and instructions to Enable Instruction-Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang

Upload: krysta

Post on 30-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Using SIMD Registers and instructions to Enable Instruction-Level Parallelism in Sorting Algorithms. Yuanyuan Sun Feiteng Yang. Source. Source ACM Symposium on Parallel Algorithms and Architectures - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Yuanyuan Sun Feiteng Yang

Using SIMD Registers and instructions to Enable Instruction-Level Parallelism in Sorting Algorithms

Yuanyuan SunFeiteng Yang

Page 2: Yuanyuan Sun Feiteng Yang

Source Source ACM Symposium on Parallel

Algorithms and Architectures Proceedings of the nineteenth annual ACM

symposium on Parallel algorithms and architectures

Authors Timothy Furtak  

José Nelson Amaral Robert Niewiadomski  

Page 3: Yuanyuan Sun Feiteng Yang

Outline Introduction Sorting network Sorting algorithms Experimental evaluation Contributions

Page 4: Yuanyuan Sun Feiteng Yang

Introduction Use SIMD resources to improve the

performance of sorting algorithms for short sequence.

Initial inspiration: need for Fast sorting of short sequences implementation of Graphics rendering in

interactive video game

SIMD machineries

Page 5: Yuanyuan Sun Feiteng Yang

Introduction

SIMD machineries X86-64’s SSE2 (Streaming SIMD Extensions 2) G5’s AltiVec

AltiVec,SSE2: SIMD instruction sets, both feature 128-bit vector registers

Page 6: Yuanyuan Sun Feiteng Yang

Sorting network a comparator network produces a sorted

output for any possible input sequence. COMP(a, b) — the inputs are two storage units: memory

locations, registers, or vector-register elements

— a and b, each containing a numerical input.

Page 7: Yuanyuan Sun Feiteng Yang

Sorting network Size: the total number of comparators in the

network. Depth: the length of the critical path in its

dependence graph.

Page 8: Yuanyuan Sun Feiteng Yang

Sorting network

Page 9: Yuanyuan Sun Feiteng Yang

Sorting network A comparator moves the larger value to

the left, and the smaller value to the right.

For instance, Figure1size=5,width=3;Inputs: a = 7, b = 2, c = 5, d = 9Output: a = 9, b = 7, c = 5, d = 2.

Page 10: Yuanyuan Sun Feiteng Yang

Supporting hardware for Sorting Network The comparator required by a sorting

network is easily constructed using these two operations, a copy instruction, and a temporary variable.

Min and max instructionsmin(a, b) = a : a ≤ b b : otherwisemax(a, b) = a : a ≥ b b : otherwise

Page 11: Yuanyuan Sun Feiteng Yang

Supporting hardware for Sorting Network x86-64 architectures supports the SSE2

min and max operations that return the minimum (maximum) packed single-precision floating-point values.

Page 12: Yuanyuan Sun Feiteng Yang

Supporting hardware for Sorting Network Width: the number of vectors being sorted.x86-64 has 16 XMM vector registers, and each

register can hold 4 floating-point values.Sorting the values in n XMM registers using a

sorting network produces 4 sorted streams of data of length n. 1 ≤ n < 16, one register must be reserved as temporary storage for the swap of values.

Page 13: Yuanyuan Sun Feiteng Yang

Three sorting methods

Two pass sorting with insertion sorting

Two pass sorting with merge sorting

One pass sorting (Register sorting)

Page 14: Yuanyuan Sun Feiteng Yang

Tow pass sorting In the first phase

the SIMD registers and instructions are used to generate a partially-sorted output.

In the second phase a standard sorting algorithm — insertion sort and mergesort are investigated in this paper — finishes the sorting.

Page 15: Yuanyuan Sun Feiteng Yang

First phase: SIMD sortVector registers

A1 B1 C1 D1

A2 B2 C2 D2

An Bn Cn Dn

……

After SIMD sort:

Page 16: Yuanyuan Sun Feiteng Yang

Second phase Insertion sort

Merge sort

A1<A5<A9 A2<A6<A10

A3<A7<A11 A4<A8<A12

A1<A2<A3 A4<A5<A6

A7<A8<A9 A10<A11<A12

A1 A2 A3 A4

A5 A6 A7 A8

A9 A10

A11

A12

A1 A4 A7 A10

A2 A5 A8 A11

A3 A6 A9 A12

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

A11

A12

Page 17: Yuanyuan Sun Feiteng Yang

One pass sorting (Register sorting)

Algorithm input Initial state Align a set of comparators Write values back to memory

Page 18: Yuanyuan Sun Feiteng Yang

4-elements example

P1={comp(a,c) comp(b,d)} P2={comp(a,b) comp(c,d)} P3={comp(b,c)}

Page 19: Yuanyuan Sun Feiteng Yang

One concrete example

Page 20: Yuanyuan Sun Feiteng Yang

SSE2 instructions used

Page 21: Yuanyuan Sun Feiteng Yang

The method is also applied to sort Key-pointer pairs and D-heaps.

Page 22: Yuanyuan Sun Feiteng Yang

Evaluation

Page 23: Yuanyuan Sun Feiteng Yang

Contributions Effectively use SIMD resources to improve

performance of sorting short sequence through the reduction of memory references and increases in ILP.

Page 24: Yuanyuan Sun Feiteng Yang

Contributions 1.three algorithms that use the SIMD machinery

for efficient in-register sorting of short sequences

2.a method to use iterative-deepening search to find fast instruction sequences to move data within the SIMD registers

3.an extensive experimental study that indicates the elimination of loads, stores, branches correlates well with improvement performance.