a demonstration of exact string matching algorithms with cuda

Post on 03-Apr-2015






Click to see full reader


Demonstration of Exact String Matching Algorithms using CUDA

Author List

Raymond Tay (Autodesk, formerly Linden Lab)


In this chapter, the author presents a demonstration application of three commonly used exact string matching algorithms using NVIDIA CUDA Technology. The algorithms are namely the Brute-force, QuickSearch and Horspool. The author attempts to apply known CUDA techniques to implement, test and optimize where applicable;challenges the author faced was mapping CUDA's threading and memory model to what is normally an algorithm designed to execute on the single core CPU. The author hopes that through this effort, to demonstrate the power of CUDA to the budding GPU developer.

Introduction, Problem Statement, and Context

String-matching is a very important subject in the wider domain of text processing. String-matching algorithms are basic components used in implementations of practical softwares existing under most operating systems. String-matching consists of finding one or more occurrences of a pattern in a body of text. All the algorithms in this work locates all occurrences of the pattern in the text body aided by GPU acceleration. The algorithms developed were tested for patterns whose length are shorter and greater than the alphabet. The pattern is denoted by x=[0..m-1] and m denotes its length, the text is denoted by y=[0..n-1] where n denotes its length; the alphabet of the text and pattern refers to all symbols used to represent strings (e.g. the alphabet of a binary string is ∑={0,1}) and is denoted by ∑ with the size equal to ∂ (e.g. the size of the alphabet for binary strings is ∂=2).

The author is aware the wide applicability of string matching algorithms ranging from text editors, the popular Unix tool grep, virus scanning technology, locating DNA sequences. The author believes that the techniques devised here can be leveraged by current mid-range workstations as they normally come equipped with CUDA/OpenCL enabled graphics cards.

Core Method

The methods applied to the development includes the following

1) Find ways to parallelize the sequential code

2) Minimize data transfer between the host and device

3) Global memory should be coalesced as much as possible

4) Avoid branch divergence within a CUDA warp

The work here for all algorithms revolves around getting a CUDA thread to execute the scanning and locating a match; if it does find a match the CUDA thread will update a data structure revealing the position where the pattern was found. The data structures needed by the CUDA threads will be provided by the CUDA kernel.

Algorithms, Implementations, and Evaluations

Brute-forceThe sequential form consists of a function, BF (acronym for BruteForce) where it attempts to match the pattern to the text by scanning the text from left to right. In the sequential code, a single thread is conducting the search and when it finds a match the algorithm will output to console the position it was found.

In the CUDA version, N threads could be conducting the same search. Each of the N threads attempts to scan for a match of the text, in parallel, and when it discovers a match a data structure for storing the found indices will be updated.

The source codes for the sequential and parallelized(CUDA) code is shown below for illustration purposes.

Each CUDA thread can potentially and possibly read each character and obtain a match, in the event that the pattern follows one another in the string; hence this translates to (N*m) bytes of

Illustration 1: Sequential Brute Force

data being read. Each CUDA thread potentially writes at most n/m times (assuming the pattern follows one after another other) but in general, the text and pattern could be absolutely random.

QuicksearchThe sequential QuickSearch is a variant of the popular Boyer-Moore Algorithm where it does not suffer from the problem of sub-optimal performance when it comes to matching patterns that inherit from small alphabets like DNA.

In the classic QuickSearch, the inventor of the algorithm dropped the “good suffix shift” aka “matching shift” computation in favour of the “bad-character shift” aka “occurrence shift” computation. This algorithms precomputes the “bad-character shift” for the pattern before using the results of the previous computation to aid in its search for pattern in the text body.

In the CUDA version, the classic QuickSearch has been reorganized so that the “bad-character shift” is parallelized; and in the scanning code the “skipping distance” data structure (which is a 1D array containing the skipping distances regardless of a match or mismatch and each valid element is a CUDA thread's id) is pre-computed which will be used by the CUDA kernel. In the CUDA kernel, the thread will only execute the scanning code if it can locate its id in the “skipping distance” data structure mentioned earlier.

Illustration 2: CUDA Brute Force

The source codes for the sequential and CUDA version of QuickSearch is presented below:

Illustration 3: Sequential QuickSearch

Illustration 4: CUDA QuickSearch

HorspoolIn the classic Horspool algorithm, the implementation favours the use of the bad-character shift computation alone and it's not very efficient when the pattern is shorter than the alphabet i.e. m

< ∂.

The “bad-character shift” computation is the same as the one shown in the sequential QuickSearch.

In the CUDA version, the approach the author's taken is very similar to the implementation of the CUDA version of QuickSearch i.e. In the CUDA version, the classic QuickSearch has been reorganized so that the “bad-character shift” is parallelized; and in the scanning code the “skipping distance” data structure (which is a 1D array containing the skipping distances regardless of a match or mismatch and each valid element is a CUDA thread's id) is pre-computed which will be used by the CUDA kernel. In the CUDA kernel, the thread will only execute the scanning code if it can locate its id in the “skipping distance” data structure mentioned earlier.

The source codes for the sequential and CUDA Horspool is shown below:

Illustration 5: Sequential Horspool

Illustration 6: CUDA Horspool

EvaluationThe author subjected the three sequential and their CUDA equivalent algorithms to benchmarking and applied some, but not all, CUDA techniques and technology. Each test was ran with 100 iterations and taking the average. The tests were ran on a 32-bit Ubuntu OS, GTX480 Nvidia Card, 8-core Intel i7 CPU, 6GB of System RAM.

Two sorts of tests were conducted: (1) pattern was shorter than the alphabet size (2) pattern was longer than the alphabet size.

One observation from the tests is that the speedup factor of the CUDA to the sequential code ranges from 31 to 106. Another observation is that the CUDA versions of the code do exhibit branch divergence and bank conflicts and this behavior is highly dependent on the pattern and the text involved.

Here is the summary:

Algorithm Type Optimization

Search runtime (milliseconds)

GPU Effective bandwidth (GBps)

Speedup factor

brute-force SEQ -O2 24 N/A -

brute-force CUDA

None 0.24 11.9 100

Shared memory

0.24 11.9

Page-locked memory

0.41 7.1 59


SEQ -O2 16 N/A -


CUDA None 0.18 15.87 88

Shared memory

0.15 19.77 106

Horspool SEQ -O2 16 N/A -

CUDA None 0.19 15.62 84

Shared memory

0.16 18.55 100

Table 1: Test results for pattern shorter than alphabet size

Algorithm Type Optimization

Search runtime (milliseconds)

GPU Effective bandwidth (GBps)

Speedup factor

brute-force SEQ -O2 21.2 N/A -

brute-force CUDA

Shared memory

0.55 5.35 38


SEQ -O2 17.2 N/A -


CUDA Shared memory

0.47 6.29 36

Horspool SEQ -O2 16.8 N/A -

CUDA Shared memory

0.53 5.56 31

Table 2: Test results when pattern is longer than the size of the alphabet

Final EvaluationThe author believes that performance gains would be better if the implementation was in (a) Asynchronous concurrent execution since multiple kernels execution concurrently would possibly improve the run times. The author investigated that optimizations beyond -O2 for the sequential algorithms did not seem to affect the overall run times.

The author's initial experimentation with page-locked/zero-copy in was not encouraging as effective bandwidth lagged significantly on the linux operation system; the author cannot offer an explanation at this point in time, why this is the case.

The author hoped to implement a multi-GPU solution but due to lack of resources, it cannot be pursued in the near future though the author would get a big kick out of it!

References• David Kirk and Wen-mei Hwu of Programming Massively Parallel Processors 2010 first

edition.• AHO, A.V., 1990, Algorithms for finding patterns in strings. in Handbook of Theoretical

Computer Science, Volume A, Algorithms and complexity, J. van Leeuwen ed., Chapter 5, pp 255-300, Elsevier, Amsterdam.

• HORSPOOL R.N., 1980, Practical fast searching in strings, Software - Practice & Experience, 10(6):501-506.

• SUNDAY D.M., 1990, A very fast substring search algorithm, Communications of the ACM . 33(8):132-142.

• Quick Search Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/

• Horspool Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/

• Brute-force Algorithm from http://www-igm.univ-mlv.fr/~lecroq/string/

• NVIDIA CUDA Programming Guide 3.0

• NVIDIA CUDA Reference Manual 3.0

• NVIDIA CUDA Best Practices Guide

top related