a massively parallel solution to a massively parallel problem hpc4ngs workshop, may 21-22, 2012...
TRANSCRIPT
GPGPU in NGS BioinformaticsA massively parallel solution to a massively parallel problem
HPC4NGS Workshop, May 21-22, 2012Principe Felipe Research Center, Valencia
Brian Lam, PhD
Talk Outline
Next-generation sequencing and current challenges
What is GPGPU computing?
The use of GPGPU in NGS bioinformatics
How it works – the BarraCUDA project
System requirements for GPU computing
What if I want to develop my own GPU code?
What to look out for in the near future?
Conclusions
Talk Outline
Next-generation sequencing and current challenges
What is GPGPU computing?
The use of GPGPU in NGS bioinformatics
How it works – the BarraCUDA project
System requirements for GPU computing
What if I want to develop my own GPU code?
What to look out for in the near future?
Conclusions
DNA sequencing
“DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases A, G, C, T in a molecule of DNA.” - Wikipedia
Next-generation DNA sequencing
Commercialised in 2005 by 454 Life Science
Sequencing in a massively parallel fashion
How it works
An example: Whole genome sequencing
Extract genomic DNA from blood/mouth swabs Break into small DNA fragments of 200-400 bp Attach DNA fragments to a surface (flow
cells/slides/microtitre plates) at a high density Perform concurrent “cyclic sequencing reaction”
to obtain the sequence of each of the attached DNA fragments
An Illumina HiSeq 2000 can interrogate 825K spots / mm2
GTCCTGA
Capturing the sequencing signals
TATTTTT
ATTCNGG
Not to scale
What do we get from the sequencers
Billions of short DNA sequences, also called sequence reads ranging from
25 to 400 bp
The throughput of NGS has increased dramatically
Current bioinformatics pipeline
Sequence Alignment
Image AnalysisBase Calling
Variant Calling, Peak calling
Just how complex is the pipeline?
Source: Genologics
Talk Outline
Next-generation sequencing and current challenges
What is Many-core/GPGPU computing?
The use of GPGPU in NGS bioinformatics
How it works – the BarraCUDA project
System requirements for GPU computing
What if I want to develop my own GPU code?
What to look out for in the near future?
Conclusions
Many-core computing architectures
A physical die that contains a ‘large number’ of processing
i.e. computation can be done in a massively parallel manner
Modern graphics cards (GPUs) consist of hundreds to thousands of computing cores
Just how fast are today’s GPUs
If GPUs are so fast why do we need CPUs?
GPUs are fast, but there is a catch: SIMD – Single instruction, multiple data
VS
CPUs are powerful multi-purpose processors MIMD – Multiple Instructions, multiple
data
CPU vs GPU
SIMD – pros and cons
The very same (low level) instruction is applied to multiple data at the same time e.g. a GTX680 can do addition to 1536 data
point at a time, versus 16 on a 16-core CPU.
Branching results in serialisations
The ALUs on GPU are usually much more primitive compared to their CPU counterparts.
GPU in scientific computing
Scientific computing often deal with large amount of data, and in many occasions, applying the same set of instructions to these data. Examples:▪ Monte Carlo simulations▪ Image analysis▪ Next-generation sequencing data analysis
Why bother?
Low capital cost and energy efficient Dell 12-core workstation £5,000, ~1kW Dell 40-core computing cluster £ 20,000+, ~6kW
NVIDIA Geforce GTX680 (1536 cores): £400, <0.2kW NVIDIA C2070 (448 cores): £1000, 0.2kW
Many supercomputers also contain multiple GPU nodes for parallel computations
GPGPU in bioinformatics
Examples: CUDASW++ 6.3X
MUMmerGPU 3.5X
GPU-HMMer 60-100x
Talk Outline
Next-generation sequencing and current challenges
What is Many-core/GPGPU computing?
The use of GPGPU in NGS bioinformatics
How it works – the BarraCUDA project
System requirements for GPU computing
What if I want to develop my own GPU code?
What to look out for in the near future?
Conclusions
A few GPGPU programs are available
Ion-torrent server (T7500 workstation) uses GPUs for base-calling
MummerGPU – comparisons among genomes
BarraCUDA, Soap3 – short read alignments
Talk Outline
Next-generation sequencing and current challenges
What is Many-core/GPGPU computing?
The use of GPGPU in NGS bioinformatics
How it works – the BarraCUDA project
System requirements for GPU computing
What if I want to develop my own GPU code?
What to look out for in the near future?
Conclusions
Heterogeneous computing
Current bioinformatics pipeline
Sequence Alignment
Image AnalysisBase Calling
Variant Calling, Peak calling
Sequence alignment
Sequence alignment is a crucial step in the bioinformatics pipeline for downstream
analyses
This step often takes many CPU hours to perform
Usually done on HPC clusters
The BarraCUDA Project – an example of GPGPU computing
The main objective of the BarraCUDA project is to develop a software that runs on GPU/ many-core architectures
i.e. to map sequence reads the same way as they come out from the NGS instrument
Software Pipeline
Genome
Read library
CPU
Copy read library to GPU
Copy genome to GPU
CPU
Copy alignment
results to CPU
Write to disk
Alignment Results
GPU
AlignmentAlignment
AlignmentAlignment
Alignment
AlignmentAlignment
AlignmentAlignment
Alignment
AlignmentAlignment
AlignmentAlignment
Alignment
Burrows-Wheeler transform
Originally intended for data compression, performs reversible transformation of a string
In 2000, Ferragina and Manzini introduced BWT-based index data structure for fast substring matching at O(n)
Sub-string matching is performed in a tree traversal-like manner
Used in major sequencing read mapping programs e.g. BWA, Bowtie, Soap2
How it works – a backward search algorithm
matching substring ‘banan’
In mathematical terms
,
Modified from Li & Durbin Bioinformatics 2009, 25:14, 1754-1760
Programming code
BWT_exactmatch(READ,i,k,l){
if (i < 0) then return RESULTS;k = C(READ[i]) + O(READ[i],k-1)+1;l = C(READ[i]) + O(READ[i],l);if (k <= l) then BWT_exactmatch(READ,i-1,k,l);
}main(){
Calculate reverse BWT string B from reference string X Calculate arrays C(.) and O (.,.) from B Load READS from disk
For every READ in READS do{i = |READ|; Positionk = 0; Lower boundl = |X|; Upper boundBWT_exactmatch(READ,i,k,l);
}
Write RESULTS to disk}
Modified from Li & Durbin Bioinformatics 2009, 25:14, 1754-1760
Porting to GPU
Simple data parallelism
Used mainly the GPU for matching
Porting to GPU
__device__BWT_GPU_exactmatch(W,i,k,l){ if (i < 0) then return RESULTS; k = C(W[i]) + O(W[i],k-1)+1; l = C(W[i]) + O(W[i],l); if (k <= l) then BWT_GPU_exactmatch(W,i-1,k,l);}__global__GPU_kernel(){ W = READS[thread_no]; i = |W|; Position k = 0; Lower bound l = |X|; Upper bound BWT_GPU_exactmatch(W,i,k,l);}
main(){
Calculate reverse BWT string B from reference string X Calculate array C(.) and O(.,.) from B Load READS from disk Copy B, C(.) and O(.) to GPU Copy READS to GPU
Launch GPU_kernel with <<|READS|>> concurrent threads
COPY Alignment Results back from GPU Write RESULTS to disk}
How fast is that?
Very fast indeed, using a Tesla C2050, we can match 25 million 100bp reads to the BWT in just over 1 min.
But… is this relevant?
Inexact Matching?
matching substring ‘anb’ where‘b’ is subsituted with an ‘a’
Inexact matching requires base substitution within the query substring
Search space complexity = O(9n)!
Klus et al., BMC Res Notes 2012 5:27
BarraCUDA started its life as a GPU version of BWA
It _________worked! 10% faster than 8x X5472 cores @ 3GHz BWA uses a greedy breadth-first search approach (takes
up to 40MB per thread) Not enough workspace for thousands of concurrent
kernel threads (@ 4KB) – i.e. reduced accuracy – NOT GOOD ENOUGH!
partially
BarraCUDA uses a depth-first search approach
hithit
SIMD – the catch
The very same (low level) instruction is applied to multiple data at the same time e.g. a GTX680 can do addition to 1536 data
point at a time, versus 16 on a 16-core CPU.
Branching results in serialisations
The ALUs on GPU are usually much more primitive compared to their CPU counterparts.
Branch serialisation
Branch divergence
Multi-kernel design
A B
hithit
A B
CPU thread queue:
Thread 1.1
Thread 1.2
Thread 1
Mapping accuracy
BWA 0.5.8 BarraCUDA
Mapping 89.95% 91.42%
Error 0.06% 0.05%
Klus et al., BMC Res Notes 2012 5:27
Mapping speed
5160 (1C) X5670 (1C)
5160 (2C) 5160 (4Cs) X5670 (6Cs)
M20900
40
80
120
160153
105
86
52
21 20
Tim
e T
ake
n (
min
)
0
0
Klus et al., BMC Res Notes 2012 5:27
Porting to GPU
Simple data parallelism
Used mainly the GPU for matching
ScalabilityKlus et al., BMC Res Notes 2012 5:27
Talk Outline
Next-generation sequencing and current challenges
What is Many-core/GPGPU computing?
The use of GPGPU in NGS bioinformatics
How it works – the BarraCUDA project
System requirements for GPU computing
What if I want to develop my own GPU code?
What to look out for in the near future?
Conclusions
System requirements
Hardware The system must have at least one decent GPU
▪ NVIDIA Geforce 210 will not work!
One or more PCIe x16 slots
A decent power supply with appropriate power connectors▪ 550W for one Tesla C2075, and + 225W for each additional cards▪ Don’t use any Molex converters!
Ideally, dedicate cards for computation and use a separate cheap card for display
System requirements
System requirements
Software CUDA▪ CUDA toolkit▪ Appropriate NVIDIA drivers
OpenCL▪ CUDA toolkit (NVIDIA)▪ AMD APP SDK (AMD)▪ Appropriate AMD/NVIDIA drivers
System requirements
For example: CUDAtoolkit v4.2
DEMO
Talk Outline
Next-generation sequencing and current challenges
What is Many-core/GPGPU computing?
The use of GPGPU in NGS bioinformatics
How it works – the BarraCUDA project
System requirements for GPU computing
What if I want to develop my own GPU code?
What to look out for in the near future?
Conclusions
Coding effort
In the end, how much effort have we put in so far?
3300 lines of new code for the alignment core▪ Compared to 1000 lines in BWA▪ Still on going!
1000 lines for SA to linear space conversion
CUDA vs OpenCL
CUDA (in general) is faster and more complete Tied to NVIDIA hardware
OpenCL Still new, AMD and Apple are pushing
hard on this Can run on a variety of hardware
including CPUs
Where to start
NVIDIA developer zone http://developer.nvidia.com/
OpenCL developer zone http://developer.amd.com/zones/
OpenCLZone/Pages/default.aspx
Things to consider while coding
Different hardware has different capabilities GT200 is very different from GF100, in terms of
cache arrangement and concurrent kernel executions
GF100 is also different from GK110 where the latter can do dynamic parallelism
Low level data management is often required, e.g. earmarking data for memory type, coalescing
memory access
OpenACC directives
The easy way out! A set of compiler directives
It allows programmers to use ‘accelerator’ without explicitly writing GPU code
Supported by CAPS, Cray, NVIDIA, PGI
How it works
Source: NVIDIA
More information?
http://www.openacc-standard.org/
Talk Outline
Next-generation sequencing and current challenges
What is Many-core/GPGPU computing?
The use of GPGPU in NGS bioinformatics
How it works – the BarraCUDA project
System requirements for GPU computing
What if I want to develop my own GPU code?
What to look out for in the near future?
Conclusions
Future Outlook
The game is changing quickly CPUs are also getting more cores : AMD 16-core Opteron 6200
series
Many-core platforms are evolving Intel MIC processors (50 x86 cores) NVIDIA Kepler platform (1536 cores) AMD Radeon 7900 series (2048 cores)
SIMD nodes are becoming more common in supercomputers
OpenCL
Talk Outline
Next-generation sequencing and current challenges
What is Many-core/GPGPU computing?
The use of GPGPU in NGS bioinformatics
How it works – the BarraCUDA project
System requirements for GPU computing
What if I want to develop my own GPU code?
What to look out for in the near future?
Conclusions
Conclusions
Few software are available for NGS bioinformatics yet, more to come
BarraCUDA is one of the first attempts to accelerate NGS bioinformatics pipeline
Significant amount of coding is usually required, but more programming tools are becoming available
Many-core is still evolving rapidly and things can be very different in the next 2-3 years
Acknowledgements
IMS-MRL, CambridgeGiles YeoPetr KlusSimon Lam
NIHR-CBRC, CambridgeIan McFarlaneCassie Ragnauth
Whittle Lab, CambridgeGraham PullanTobias Brandvik
Microbiology, University College CorkDag Lyberg
Gurdon Institute, CambridgeNicole Cheung
HPCS, CambridgeStuart Rankin
NVIDIA CorporationThomas BradleyTimothy Lanfear
Thank you.