introduction to seqan, an open-source c++ template library

Sign up for FREE GPU Test Drive on remotely hosted clusters

Develop your codes on latest GPUs today

Test Drive NVIDIA GPUs! Experience The Acceleration

www.nvidia.com/GPUTestDrive

Prof. Dr. Knut Reinert Algorithmische Bioinformatik, FB Mathematik und Informatik

Intro to SeqAn An Open-Source C++ template library for biological sequence analysis Knut Reinert, David Weese Freie Universität Berlin Berlin Institute for Computer Science

3

This talk

Why SeqAn?

SeqAn as SDK

Generic Parallelization

SeqAn concept/content

4 Nvidia Webinar, 22.10.2013

~ 15 years ago...

Data volume and cost: In 2000 the 3 billion base pairs of the human genome were sequenced for about 3 billion US$ Dollar 100 million bp per day


Sequencing today...

Within roughly ten years sequencing has become about 10 million times cheaper

Illumina HiSeq 100 Billion bps per DAY


Future of NGS data analysis


Software libraries bridge gap

Theoretical Considerations

Algorithm design

Prototype implementation

Maintainable tool

Analysis pipelines

Computer Scientists

Experimentalists

Algorithm libraries

RNA-Seq

ChIP-Seq

Structural variants Metagenomics abundance

Sequence assembly Cancer genomics

FM-index

Suffix arrays

Multicore

Hardware acceleration

K-mer filter

Fast I/O

Secondary memory


SeqAn Now SeqAn/SeqAn tools have been cited more

than 360 times Among the institutions are (omitting German institutes): Department of Genetics, Harvard Medical School, Boston, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, J. Craig Venter Institute, Rockville MD, USA, Department of Molecular Biology, Princeton University, Applied Mathematics Program, Yale University, New Haven, IBM T.J. Watson Research Center, Yorktown Heights, The Ohio State University, Columbus, University of Minnesota, Australian National University, Canberra, Department of Statistics, University of Oxford, Swedish University of Agricultural Sciences (SLU), Uppsala, Graduate School of Life Sciences, University of Cambridge, Broad Institute, Cambridge, USA, EMBL-EBI, University of California, University of Chicago, Iowa State University, Ames, The Pennsylvania State University, Peking University, Beijing University of Science and Technology of China, BGI-Shenzhen, China, Beijing Institute of Genomics……

Is under BSD license and hence free for academic AND commercial use.


SeqAn developers

0

2

4

6

8

10

12

14

16

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

External CSC BMBF DFG IMPRS FU


SeqAn main concepts


length(str)

Value<T>::Type

String<Subclass>


void swap(string & str) { char help = str[1]; str[1] = str[0]; str[0] = help;

}


template <typename T> void swap(T & str) { char help = str[1]; str[1] = str[0]; str[0] = help;

}


template <typename T> void swap(String<T> & str) { T help = str[1]; str[1] = str[0]; str[0] = help;

}


template <typename T> void swap(T & str) { T::value_type help = str[1]; str[1] = str[0]; str[0] = help;

}


template <typename T> void swap(T & str) { Value<T>::Type help = str[1]; str[1] = str[0]; str[0] = help;

}


template <typename T> struct Value { typedef T Type; };

Metafunction


template <typename T> struct Value< String<T> > { typedef T Type; };





template < > struct Value< char * > { typedef char Type; };


template < > struct Value< char * > { typedef char Type; };


template < t_size N > struct Value< char [N] > { typedef char Type; };



}


template <typename T> void swap(T & str) { Value<T>::Type help = value(str,1); value(str,1) = value(str,0); value(str,0) = help;

}


template <typename T> Value<T> & value( T & str, int i) { return str[i]; };

Shim Function


template <typename T> void swap(T & str) { Value<T>::Type help = value(str,1); value(str,1) = value(str,0); value(str,0) = help;

}

Generic Algorithm


SeqAn Content - SDK


SeqAn SDK Components - Tutorials


SeqAn SDK Components – Reference Manual


SeqAn SDK Components

CDash/CTest to automatically compile and test across platforms Review Board to ensure code quality Code coverage reports


SeqAn Content algorithms & data structures


Standard DP-Algorithms Global & Semi Global Alignments Local Alignments

Modified DP-Algorithms Split Breakpoint Detection Banded Chain Alignment

Unified Alignment Algorithms

For Example ...

Versatile & Extensible DP-Interface


Unified Alignment Algorithms For Example ...

Banded Smith-Waterman with Affine Gap Costs: DPBand<BandOn>(lowerDiag, upperDiag),

DPProfile<LocalAlignment<>, AffineGaps, TracebackOn<> >

Semi-Global Gotoh without Traceback: DPProfile<GlobalAlignment<FreeEndGaps<True, False, True, False> >,

AffineGaps, TracebackOff>

Split-Breakpoint Detection for Right Anchor: DPProfile<SplitAlignment<>, AffineGaps, TracebackOn<GapsRight> >

Needleman-Wunsch with Traceback: DPProfile<GlobalAlignment<>, LinearGaps, TracebackOn<> >


Support for Common File Formats Important file formats for HTS analysis

Sequences FASTA, FASTQ Indexed FASTA (FAI) for random access

Genomic Features GFF 2, GFF 3, GTF, BED

Read Mapping SAM, BAM (plus BAM indices)

Variants VCF

… or write your own parser

Tutorials and helper routines for writing your own parsers.

SequenceStream ss(“file.fa.gz”); while (!atEnd(ss)) { readRecord(id, seq, ss); cout << id << '\t' << seq << '\n'; }

BamStream bs(“file.bam”); while (!atEnd(bs)) { readRecord(record, bs); cout << record.qName << '\t' << record.pos << '\n’; }


Journaled Sequences

Store Multiple Genomes Save Storage Capacities

StringSet<TJournaled, Owner<JournalSet> > set;

setGlobalReference(set, refSeq);

appendValue(set, seq1);

join(set, idx, JoinConfig<>());

String<Dna, Journaled<Alloc<> > > ��

G1:

G2:

GN:

Ref:

��


Fragment Store (Multi) Read Alignments

Read alignments can be easily imported: … and accessed as a multiple alignment, e.g. for visualization:

std::ifstream file("ex1.sam"); read(file, store, Sam());

AlignedReadLayout layout; layoutAlignment(layout, store); printAlignment(svgFile, Raw(), layout, store, 1, 0, 150, 0, 36);


Unified Full-‐Text Indexing Framework Available Indices

All indices support multiple strings and external memory construction/usage.

Index<TSeq, IndexEsa<> > Index<StringSet<TSeq>, FMIndex<> >

Suffix Trees: •  suffix array •  enhanced suffix array •  lazy suffix tree

Prefix Trie: •  FM-index

q-Gram Indices: •  direct addressing •  open addressing •  gapped

All indices support the (sequential) find interface:

Finder<TIndex> finder(index); while (find(finder, "TATAA")) cout << "Hit at position" << position(finder) << endl;

Index Lookup Interface


SeqAn Performance


Masai read mapper


Algorithm is based on the simultaneous traversal of two string indices (e.g., FM-‐index, Enhanced suffix array, Lazy suffix tree)

ACGCTTCATCGCCCT…

Index of reads (Radix tree of seeds)

Index of genome (e.g. FM-‐index)

Reads

Chr. 2 Chr. 1

Chr. X

Genome

Masai read mapper


Read Mapping: Masai

Faster and more accurate than BWA and BowLe2 Timings on a single core


Easily exchange index….


Collaboration to parallelize indices and verification algorithms in SeqAn, to speed up any applications making use of indices

What about multi-core implementation?


SeqAn going parallel

GOAL Parallelize the finder interface of SeqAn

so it works on CPU and accelerators like GPU

Will be replaced by hg18 and 10 million 20-‐mers



Construct FM-‐index on reverse genome

Set # OMP threads Call generic count funcLon


SeqAn going parallel : NVIDIA GPUs

SAME count funcLon as on CPU !

Copy needles and index to GPU


…12... 2.66 sec

18.6 sec 1 X

Intel Xeon Phi 7120, 244 threads

2.18 sec


Count occurrences of 10 million 20-‐mers in the human genome using an FM-‐index

47 X

7 X

NVIDIA Tesla K20

I7,3.2 GHz

8.5 X

0.4 s


66.1 s

…12...

1 X


Approx. count occurrences of 1.2 million 33-‐mers in the human genome using an FM-‐index

20.7 X

7.3 X

NVIDIA Tesla K20

I7,3.2 GHz

16.9 X

9.0 s

3.9 s

3.2 s



Part II: The details

Parallelization on the GPU

Nvidia Webinar, 22.10.2013

CUDA preliminaries


In order to use CUDA we first had to adapt some parts of SeqAn:

•  CUDA requires each funcLon to be prefixed with domain qualifiers __host__ or __device__ in order to generate CPU/GPU code

•  We prefixed all basic template funcLons with a SEQAN_HOST_DEVICE macro

•  StaLc const arrays are not allowed in the way SeqAn defines them

•  We replaced alphabet conversion lookup tables (e.g. Dna<--> char) by conversion funcLons

#ifdef __CUDACC__ !#define SEQAN_HOST_DEVICE inline __device__ __host__ !#else!#define SEQAN_HOST_DEVICE inline !#endif!

•  Instead of defining a new CUDA string we simply use the Thrust library:

•  Provides host_vector and device_vector classes, which are vectors with buffers in host or device memory

•  However, Thrust funcLons are callable only from host-‐side

•  We made both vectors accessible from SeqAn

•  SeqAn strings have to provide a set of global (meta-‐)funcLons, e.g. Value<>, resize(), …

•  We simply defined the required wrapper funcLons for these two vectors

Strings


Standard Strings

•  Up to here, all strings can only be used on the side of their scope


Device Memory Host Memory

thrust::host_vector! Buffer

seqan::String ! Buffer seqan::String ! Buffer

thrust::device_vector! Buffer

•  How to access a device_vector from device-‐side?

•  We could pass (POD) iterators to the kernel

•  However, many SeqAn algorithms work on more complex containers

•  We need the same interface of the container on the device side

•  For strings we developed a so-‐called ContainerView (POD type)

•  Provides a container interface given the begin/end pointers of vector buffer

•  The view() funcLon creates the ContainerView object for a given device_vector!

Host-Device String


Host-Device String


Device Memory Host Memory

thrust::device_vector! Buffer

view() !

seqan::ContainerView! seqan::ContainerView!kernel launch !

•  How to use a device_vector on the device

•  For generic GPU programming:

•  The Device metafuncLon returns the device-‐memory equivalent of a class

•  The View metafuncLon returns the (POD) view type of a class

// Replaces String with thrust::device_vector. !template <typename TValue, typename TSpec> !struct Device<String<TValue, TSpec> > !{ ! typedef thrust::device_vector<TValue> Type; !}; !

// Returns a view type that can be passed to a CUDA kernel.!template <typename TValue, typename TAlloc> !struct View<thrust::device_vector<TValue, TAlloc> > !{ ! typedef ContainerView<thrust::device_vector<TValue, TAlloc> > Type; !}; !

Device and View metafunctions


•  A simple example to reverse a string on the GPU

// A standard SeqAn string over the Dna alphabet.!String<Dna> myString = "ACGT"; !!// A Dna string on device global memory.!typename Device<String<Dna> >::Type myDeviceString; !!// Copy the string to global memory.!assign(myDeviceString, myString); !!// Pass a view of the device string to the CUDA kernel.!myKernel<<<1,1>>>(view(myDeviceString)); !!// TString is ContainerView<device_vector<Dna> >.!template <typename TString> !__global__ void myKernel(TString string) !{ ! printf(”length(string) = %d\n", length(string)); ! reverse(string); !} !

Hello world


•  More complex structures (e.g. Index, Graph) can only be ported to the GPU if they …

•  don’t use pointers

•  use only strings of POD types (String<Dna>, but not String<String<…> >)

•  use only 1-‐dimensional StringSets (ConcatDirect)

•  Nested classes are no problem

•  View metafuncLon converts all member types into their view types

•  view() funcLon is called recursively on all members

Porting complex data structures


Example: FM Index


The FM-index (BWT, LF-mapping)


The FM-index (search ssi)


a3 = C(‘i’) + Occ(‘i’,0) + 1 = 1 + 0 + 1 b3 = C(‘i’) + Occ(‘i’,12) = 1 + 4

The FM-index (backwards search)


a1 = C(‘s’) + Occ(‘s’,8) + 1 = 8 + 2 + 1 b1 = C(‘s’) + Occ(‘s’,10) = 8 + 4

•  The FM-‐index can be implemented using a number of string-‐based lookup tables

•  ... as well as other indices, e.g. enhanced suffix array, q-‐gram index

•  There is a space-‐Lme tradeoff between all these indices

•  The FM index has the minimal memory requirements

The FM-index in SeqAn


•  SeqAn‘s FM-‐index consists of some nested classes storing Strings

FM-‐index (host-‐only)

A generic FM-index


•  The Device type of the FM index uses device_vector instead of String !

•  The view of this object (= device-‐part) is the same tree, where leaves are replaced by ContainerViews of device_vectors

GPU FM-‐index (host-‐part)

A generic FM-index


CPU vs. GPU

•  Invoking an FM-‐index based search on CPU and GPU:

// Select the index type.!typedef Index<DnaString, FMIndex<> > TIndex; !!// Type is Index<device_vector<Dna>, FMIndex<> >.!typedef typename Device<TIndex>::Type TDeviceIndex; !!// ======== On CPU ======== // ========== On GPU ===========!!// Create an index. // Create a device index.!TIndex index("ACGTTGCAA"); TIndex index("ACGTTGCAA"); ! TDeviceIndex deviceIndex; ! assign(deviceIndex, index); !!// Use the FM-index on CPU. // Use the FM-index in a CUDA kernel.!findCPU(index,…); findGPU<<<...>>>(view(deviceIndex),…); !!template <typename TIndex> template <typename TIndex> !void __global__ void!findCPU(TIndex & index,…); findGPU(TIndex index,…); !


The findGPU kernel AND the findCPU function will invoke many

instances of the SAME generic function which will perform a backtracking algorithm on our

generic index interface

do { ! if (finder.score == finder.scoreThreshold) ! { ! if (goDown(textIt, suffix(pattern, patternIt))) delegate(finder); ! goUp(textIt); ! if (isRoot(textIt)) break; ! } ! else if (finder.score < finder.scoreThreshold) ! { ! if (atEnd(patternIt)) delegate(finder); ! else if (goDown(textIt)) ! { ! finder.score += parentEdgeLabel(textIt) != value(patternIt); ! goNext(patternIt); ! continue; ! } ! } !! do { ! goPrevious(patternIt); ! finder.score -= parentEdgeLabel(textIt) != value(patternIt); ! } while (!goRight(textIt) && goUp(textIt)); !! if (isRoot(textIt)) break; ! finder.score += parentEdgeLabel(textIt) != value(patternIt); ! goNext(patternIt); !} !while (true); !

Approximate search via backtracking


Outlook for GPU support


•  Our next steps are:

•  Provide parallelFor() to hide CUDA kernel call/OpenMP for-‐loop

•  Develop classes for concurrent access (String, job queues)

•  Port more indices and index iterators to be used with CUDA

•  Port SeqAn‘s alignment module

•  Develop a CPU/GPU version of the FM-‐index based read mapper Masai

•  ...

•  Follow our development:

•  Sources: hqps://github.com/seqan/seqan/tree/develop

•  Code examples: hqp://trac.seqan.de/wiki/HowTo/DevelopCUDA


Generic Parallelization

Multicore parallelization

struct Serial_; !typedef Tag<Serial_> Serial; !!struct Parallel_; !typedef Tag<Parallel_> Parallel; !

•  We first introduced Tags to switch between serial and parallel algorithms:

template <typename T> !inline T atomicInc(T &x, Serial) !{ ! return ++x; !} !!template <typename T> !inline T atomicInc(volatile T &x, Parallel) !{ ! __sync_add_and_fetch(&x, 1); !} !

•  Then we defined basic atomic operaLons required for thread safety:

•  To this end, we developed the Splitter<TValue, TSpec> to compute a parLLon into subintervals of (almost) equal length …

Splitter<unsigned> splitter(10, 20, 3); !for (unsigned i = 0; i < length(splitter); ++i) ! cout << '[' << splitter[i] << ',' << splitter[i+1] << ')' << endl; !!// [10,14) !// [14,17) !// [17,20)!

Splitter

•  The Spliqer can also be used with iterators directly

•  The Serial / Parallel tag divides an interval range into 1 / #thread_num many intervals

•  The parallel tag can be used to switch off the parallel behaviour

template <typename TIter, typename TVal, typename TParallelTag> !inline void arrayFill(TIter begin_, TIter end_, ! TVal const &value, Tag<TParallelTag> parallelTag) !{ ! Splitter<TIterator> splitter(begin_, end_, parallelTag); !! SEQAN_OMP_PRAGMA(parallel for) ! for (int job = 0; job < (int)length(splitter); ++job) ! arrayFill(splitter[job], splitter[job + 1], value, Serial()); !} !

Splitter

73

…12... 2.66 sec

18.6 sec 1 X


2.18 sec


Count occurrences of 10 million 20-‐mers in the human genome using an FM-‐index

47 X

7 X

NVIDIA Tesla K20

I7,3.2 GHz

8.5 X

0.4 s

Thank you for your attention

Upcoming GTC Express Webinars

Register at www.gputechconf.com/gtcexpress

October 23 - Revolutionize Virtual Desktops with the One Missing Piece: A Scalable GPU

October 30 - OpenACC 2.0 Enhancements for Cray Supercomputers

October 31 - Getting the Most out of NVIDIA GRID vGPU with Citrix XenServer

November 5 - Accelerating Face-in-the-Crowd Recognition with GPU Technology

November 6 - Bright Cluster Manager: A CUDA-ready Management Solution for GPU-based HPC

GTC 2014 Call for Posters

Posters should describe novel or interesting topics in

§  Science and research

§  Professional graphics

§  Mobile computing

§  Automotive applications

§  Game development

§  Cloud computing

Call opens October 29

www.gputechconf.com

Sign up for FREE GPU Test Drive on remotely hosted clusters

Develop your codes on latest GPUs today

Test Drive NVIDIA GPUs! Experience The Acceleration

www.nvidia.com/GPUTestDrive

introduction to seqan, an open-source c++ template library

Technology

platforms nvidia webinar

typestring nvidia webinar

type help

value template

str0 str0

str1 str1

t help

daynvidia webinar