![Page 2: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/2.jpg)
2
Outline
Introduction
Port BWA to the grid
![Page 3: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/3.jpg)
3
Hardware sequencing
Run Reads GBs
a 388,850,958 82
b 518.902.304 108
c 500.529.852 105
![Page 4: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/4.jpg)
4
Initial analysis
First analysis
Match sequence reads to the human genome
Generate a SNP list
Visualize the results
Tools:
BWA, Samtools and IGV
http://bio-bwa.sourceforge.net/http://samtools.sourceforge.net/http://www.broadinstitute.org/igv/
![Page 5: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/5.jpg)
5
Data size
Used data in grid enabled BWA sequencing:
D10: sequencing Data in the csFasta format, 25-35GBD20: quality files in the .qual format, 50-80GBD30: reference DB in the Fasta BS format, 3.2GB (human genome)
140MB (one chromosome)D35: reference BWA index, 4.5GB (human genome)
240MB (one chromosome)D40: sequencing Data in the FastQ format (fastq.gz), 20-30GBD45: results in .sai format (direct output of BWA), 2-3GBD50: results in .sam format, 55-75GBD60: results in .bam format, 20-30GB
![Page 6: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/6.jpg)
6
Small cluster
Existing hardware
PC
![Page 7: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/7.jpg)
7
Buy a bigger cluster (centralized model)
![Page 8: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/8.jpg)
8
Grid computing
Distributed resources
Computing
Data storage
Open protocols
It's about sharing
Resources
Methods
Collaborations
![Page 9: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/9.jpg)
9
Dutch life science grid (hardware)
grid
http://www.biggrid.nl/
![Page 10: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/10.jpg)
10
Software at the AMC
http://www.vl-e.nl/vlemedhttp://www.bioinformaticslaboratory.nl/
Olabarriaga SD, Glatard T, de Boer PT: A Virtual Laboratory for Medical Image Analysis.IEEE Transactions on Information Technology In Biomedicine, in press
![Page 11: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/11.jpg)
User interface
http://www.vl-e.nl/vbrowser/
SARAAMC
11
![Page 12: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/12.jpg)
12
Workflow pre-processing and split data
csFastaQualityvalues
solid2fastq.pl
fastq
Split (multiples of 4 lines)
fastqfastqfastqfastq
Se q
uen
c e r
e ads
Refer en
ce data
base
chr9.fa
Bwa index
chr9.fa.amb chr9.fa.ann
chr9.fa.bwt chr9.fa.pacchr9.fa.rbwt
chr9.fa.rpacchr9.fa.rsa
chr9.fa.sa
Tar zcvf
chr9.tar.gzfastqfastqfastqfastq.gz
gzip
![Page 13: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/13.jpg)
13
Align reads with BWAchr9.tar.gz
fastq.gz
Tar zxvf
chr9.*chr9.*chr9.*chr9.fa.*Bwa aln
result.sai
result.sam
result.bam
result_sorted.bam
Samtools samse (sai to sam)
Samtools view (sam to bam)
Samtools sort
Do this for every
Split fastq file
Do this for everychromosome
![Page 14: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/14.jpg)
14
Merge results and create a SNP listresult_sorted.bamresult_sorted.bamresult_sorted.bamresult_sorted.bamresult_sorted.bam
Samtools merge
result_sorted_merged.bam
Samtools index
result_sorted_merged.bai
Samtools pileup
SNP-list.pileup
samtools.pl varFilter raw.pileup | awk '$6>=20' > final.pileup
SNP-list-filtered.pileup pileup-to-bed.pl SNP-list-filtered.pileup
Local server
![Page 15: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/15.jpg)
15
Integrated Genome Viewer
![Page 16: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/16.jpg)
16
SNP list
Column Definition------- -------------------------------------------------------- 1 Chromosome 2 Position (1-based) 3 Reference base at that position 4 Consensus bases 5 Consensus quality 6 SNP quality 7 Maximum mapping quality 8 Coverage (# reads aligning over that position) 9 Bases within reads where (see Galaxy wiki for more info) 10 Quality values (phred33 scale, see Galaxy wiki for more)
chr9 49 * */+C 47 47 36 14 * +C 10 3 1 1 0chr9 152 * */+A 530 530 33 40 * +A 21 19 0 0 0chr9 190 * */-t 1037 1037 36 78 * -t 47 31 0 0 0chr9 274 * */-c 521 521 30 67 * -c 50 17 0 0 0chr9 340 * +A/+A 13 59 35 5 +A * 3 2 0 0 0chr9 362 c Y 39 40 32 8 .,+1t,+1a,+1atgtt :]J5F/LAchr9 469 g S 52 52 36 11 .CCCC....^F.^F. ]]]Y]]]]]][chr9 576 c S 27 27 35 33 .$.,......,...Gg.gg......,g...,,,^F, ][]]]X]RY]]]]]]R]]]]]]]]]]Z]]LNO]chr9 712 a R 59 59 36 24 ,....g.G.G....,..g,,...G ]]W]]]]]]]W]]V]]UH]U]]]]chr9 869 c G 36 36 25 4 GGG, SF]!chr9 1508 c S 34 34 34 24 ,$,,.,,,..,g,GG,,..GG.,.. ]]]+Y[]SHI\X]]]Z]Z]]]W]]chr9 1547 t Y 157 157 33 32 ..CCCCCC...,,..,...cA,cC.CC,cC., BGYWOSTT\O]K/T]M]T]L!DB]]]]!8]]Q
![Page 17: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/17.jpg)
17
Varscan
Chrom
Position
Ref
Var
Reads1
Reads2
VarFreq
Strands1
Strands2
Qual1
Qual2
Pvalue
MapQual1
MapQual2
Reads1Plus
Reads1Minus
Reads2Plus
Reads2Minus
chr10
83042
G A 29 2 6.45%
2 2 48 60 0.98 1 1 17 12 1 1
chr10
83161
G A 36 11 23.40%
2 2 57 58 0.98 1 1 16 20 3 8
chr10
83763
T C 33 4 10.81%
2 2 57 53 0.98 1 1 13 20 2 2
chr10
83816
C T 14 4 22.22%
2 1 52 60 0.98 1 1 7 7 4 0
chr10
84005
G A 22 3 12.00%
2 2 58 60 0.98 1 1 12 10 1 2
![Page 18: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/18.jpg)
18
Port BWA to grid
Simple shell script to run BWA
Create GASW description (component description)
Create workflow description (perlscript or with Taverna)
Copy all binaries, perl-scripts, gasw-description,workflow description to grid
Copy test-dataset to grid
Test workflow
Execute workflow on real datasets
![Page 19: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/19.jpg)
19
BWA on grid – user interface
lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/bwa/Scufl/BWAparam.scufl
![Page 20: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/20.jpg)
20
BWA on grid – component description
![Page 21: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/21.jpg)
21
BWA on grid – component description
![Page 22: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/22.jpg)
22
BWA on grid – workflow description
![Page 23: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/23.jpg)
23
BWA on grid – workflow design in Taverna
![Page 24: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/24.jpg)
24
BWA on grid – user interface
lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/bwa/Scufl/BWAparam.scufl
![Page 25: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/25.jpg)
25
BWA on grid – monitor jobs
![Page 26: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/26.jpg)
26
BWA on grid – monitor jobs
![Page 27: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/27.jpg)
27
Overview BWA workflow
Mark Santcroos
![Page 28: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/28.jpg)
28
Implementation workflows
SNPanalysis
Compare SNPsWith known
SNPs/mutations
Prioritize SNPs
Qualitycontrol
Basic experimentstatistics
Quality ofreads
Check exoncoverage
Quality ofdbSNP
MutationConsequences
(silent, aa change)
SNPconservation
![Page 29: NGS data analysis on the grid · Barbera van Schaik b.d.vanschaik@amc.uva.nl NGS Bioassist meeting 20-08-2010. 2 Outline Introduction Port BWA to the grid. 3 Hardware sequencing Run](https://reader030.vdocuments.site/reader030/viewer/2022040823/5e6d82504d37e22c935bd852/html5/thumbnails/29.jpg)
29
Optimization / IT
Optimal inputFile size
Split and merge
Security ofSequence data
Copy large datasetsTo grid storage
Datastorage
Lisa cluster
Cloudcomputing
Other workflowsystems