1000g/uk10k: bioinformatics, storage, and compute challenges of large scale resequencing
TRANSCRIPT
Vertebrate Resequencing Informatics 8th December, 2010
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing
Thomas Keane, Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute, Cambridge, UK E: [email protected]
Vertebrate Resequencing Informatics 8th December, 2010
1000G Update
Total Number of Base Pairs
23,416GB
Aligned Base Pairs 13,527GB Number of Samples 1103 Samples with > 10GB raw sequence
1078
Samples with > 10GB aligned sequence
718
Laura Clarke
Vertebrate Resequencing Informatics 8th December, 2010
1000G update – Raw Sequence Growth
0
5000
10000
15000
20000
25000
12/17/13 1/17/14 2/17/14 3/17/14 4/17/14 5/17/14 6/17/14 7/17/14 8/17/14 9/17/14 10/17/14
CEU
YRI
JPT
TSI
CHB
ASW
LWK
MXL
GBR
CHS
FIN
PUR
CLM
IBS
Laura Clarke
Vertebrate Resequencing Informatics 8th December, 2010
UK10K
Large scale population/medical based sequencing project UK10K project recently funded by WT 4,000 cohort samples genome wide @ 6x
Deeply phenotyped TwinsUK and ALSPAC cohorts 6,000 exomes from extreme samples
Protein coding exons from GenCode Extreme end of traits of medical interest, and from collections of familial
cases Accumulation of rare variants within genes or pathways
Utilise computational methods, data formats and workflows developed during 1000 genomes project
Data release via EGA under access control Estimating 100Tbp of raw sequence data http://www.uk10k.org
Vertebrate Resequencing Informatics 8th December, 2010
1000G BAM File Evolutions
BAM Until now BAMs included all raw data Recently tag removal
OQ: original qualities Non-standard tags: XM, XG, XO
Also added BAQ differences to indicate non-confidently aligned bases Space saving of 30%
E.g NA19625: 1.45 vs 0.98 bytes per bp Primary gain is from removal of original qualities
Further proposals Replace base calls with ‘=‘ sign to indicate agreement with reference Rejected due to lack of tool support
Vertebrate Resequencing Informatics 8th December, 2010
Population/Transposed BAM
Traditionally BAM files have been produced per sample with all of the lanes/libraries merged Lanes -> Library -> Platform -> Sample (1 per individual)
Problem: population based SNP calling needs to be aware of the reads across multiple samples at same loci Problems with opening hundreds/thousands of file handles
simultaneously Distributed/parallel file systems like reading a few large striped files
Solution: Transposed BAMs Genome slices with multiple samples within single BAM
E.g. entire CEU population Header information to separate read groups into samples
Samtools mpileup, GATK etc support this functionality
Vertebrate Resequencing Informatics 8th December, 2010
Horizontal/Transposed BAM
Chr1 NA19294
NA18943
NA19305 . . . . .
Chr2 ……..
……..
……..
……..
Chr1 Chr2
Chr1 Chr2
Transposed BAMs
Key questions Slice size – chromosome? 1Mbp, 10Mbp or 100Mbp? Size of individual groupings – 10, 50, 100, 500 individuals?
Vertebrate Resequencing Informatics 8th December, 2010
VCF Format
Fully adopted by 1000G group as interchange format for variant calls SNPs, indels, and recently SVs Genotyping calls for all samples Annotation of variants via user-defined tags VCF APIs and tools via http://vcftools.sourceforge.net Scaling issues with VCF – BCF format in development
Petr Danecek
Vertebrate Resequencing Informatics 8th December, 2010
VCF (useful) Bloat
Every release of 1000G adds more tags to VCF files ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions"> ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"> ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads"> ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
UK10K propose rich annotation of VCF files Known SNPs/indels
RS IDs, G1K unancessioned SNPs Geographical information
Ensembl annotation (coding, exonic, intronic, UTR, splice..) microRNA, eQTL, known disease loci
Coding consequences Synonymous/non-synonymous, splice, stop, GERP score
Functional interpretation Polyphen, Sift, PANTHER
Vertebrate Resequencing Informatics 8th December, 2010
Storage Challenges
Storage Try to reduce the proportion of raw data we keep (e.g. images, OQ in
BAM, remove base calls in BAM etc.) However there’s still a LOT of data to store and analyse! Estimation for our group based on ~200Tbp of sequencing data over next
2-3 years 1.5 Pbytes
Permanent: Lane alignments, transposed BAMs, horizontal BAMs, bi-monthly releases, backup of lane BAMs, Variant calls
Transient: Library BAMs, Local assemblies
Storage type optimality criteria Cost per Tbyte Proximity to compute resources Scalability – room for expansion/future proofing I/O throughput Disaster recovery
Vertebrate Resequencing Informatics 8th December, 2010
A Tiered Solution
3 tiered storage model Trade off cost, quantity, i/o throughput Similar to caching strategies in computer design Level 0: Local disk, closest proximity to CPU, intermediate temp files e.g.
local assemblies, reference files Level 1: High-performance, highly parallel, close proximity to compute,
expensive, suitable for high i/o tasks Level 2: Mid-tier storage, some type of nfs technology, discrete units with
some local compute, suitable for low i/o tasks that are compute intensive, scalable by adding more discrete units
Level 3: High latency storage, warehouse storage, not suitable to compute against, occasional access e.g. old data releases (Level 3a: Off-site replication of data in level 3)
Vertebrate Resequencing Informatics 8th December, 2010
A Tiered Solution
3Gb/sec
800Mb/sec
Cost
2
1
1
CP
U Farm
Size
1
2
2
Level 1 Data: Current release horizontal + transposed BAMs Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
Level 2 Data: Lane level BAMs Processes: Alignment, recalibration, local realignment
Level 3 Data: Old release BAMs + variant calls backup
Level 1: High performance
Level 2: Middle tier/nfs
Level 3: Backup/warehouse Level 3a: Off-site replication
Vertebrate Resequencing Informatics 8th December, 2010
Compute Challenges
Compute New algorithms continually developed for more accurate variant calling 2010 several new processes added into production pipeline
BAM Improvement Local realignment around indels to correct mapping biases (e.g. GATK) Adding BAQ differences up front
Indel calling by local assembly/alternative haplotype analysis (e.g. dindel) Local reassembly of SV breakpoints
Easy to estimate runtime for known processes (e.g. mapping, recalibration, duplicate removal) Challenge to estimate runtime for next 2-3 years for new algorithms E.g. more use of assembly methods – more complex references?
I/O has become a significant bottleneck and is most difficult thing to measure All computations need to minimise I/O
E.g. transforming BAM files to different sort orders
Vertebrate Resequencing Informatics 8th December, 2010
Project Data Release
Do we need to release BAMs? Large scale human phenotype driven sequencing projects going forward Participants are more interested in the variants than the raw data
BAM files may contain too much data and too large to ship around amongst project members UK10K proposals Lane BAM files submitted to the archives Not release BAM files via project ftp Project data release comprise solely of annotated VCF files Raw data can be obtained from the archives