1000g/uk10k: bioinformatics, storage, and compute challenges of large scale resequencing

Vertebrate Resequencing Informatics 8th December, 2010

1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale resequencing

Thomas Keane, Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute, Cambridge, UK E: [email protected]


1000G Update

Total Number of Base Pairs

23,416GB

Aligned Base Pairs 13,527GB Number of Samples 1103 Samples with > 10GB raw sequence

1078

Samples with > 10GB aligned sequence

718

Laura Clarke


1000G update – Raw Sequence Growth

0

5000

10000

15000

20000

25000

12/17/13 1/17/14 2/17/14 3/17/14 4/17/14 5/17/14 6/17/14 7/17/14 8/17/14 9/17/14 10/17/14

CEU

YRI

JPT

TSI

CHB

ASW

LWK

MXL

GBR

CHS

FIN

PUR

CLM

IBS

Laura Clarke


UK10K

Large scale population/medical based sequencing project UK10K project recently funded by WT  4,000 cohort samples genome wide @ 6x

 Deeply phenotyped TwinsUK and ALSPAC cohorts  6,000 exomes from extreme samples

 Protein coding exons from GenCode  Extreme end of traits of medical interest, and from collections of familial

cases  Accumulation of rare variants within genes or pathways

 Utilise computational methods, data formats and workflows developed during 1000 genomes project

 Data release via EGA under access control  Estimating 100Tbp of raw sequence data  http://www.uk10k.org


1000G BAM File Evolutions

BAM  Until now BAMs included all raw data  Recently tag removal

 OQ: original qualities  Non-standard tags: XM, XG, XO

 Also added BAQ differences to indicate non-confidently aligned bases  Space saving of 30%

 E.g NA19625: 1.45 vs 0.98 bytes per bp  Primary gain is from removal of original qualities

Further proposals  Replace base calls with ‘=‘ sign to indicate agreement with reference  Rejected due to lack of tool support


Population/Transposed BAM

Traditionally BAM files have been produced per sample with all of the lanes/libraries merged  Lanes -> Library -> Platform -> Sample (1 per individual)

Problem: population based SNP calling needs to be aware of the reads across multiple samples at same loci  Problems with opening hundreds/thousands of file handles

simultaneously  Distributed/parallel file systems like reading a few large striped files

Solution: Transposed BAMs  Genome slices with multiple samples within single BAM

 E.g. entire CEU population  Header information to separate read groups into samples

 Samtools mpileup, GATK etc support this functionality


Horizontal/Transposed BAM

Chr1 NA19294

NA18943

NA19305 . . . . .

Chr2 ……..

……..

……..

……..

Chr1 Chr2

Chr1 Chr2

Transposed BAMs

Key questions   Slice size – chromosome? 1Mbp, 10Mbp or 100Mbp?   Size of individual groupings – 10, 50, 100, 500 individuals?


VCF Format

Fully adopted by 1000G group as interchange format for variant calls   SNPs, indels, and recently SVs   Genotyping calls for all samples   Annotation of variants via user-defined tags   VCF APIs and tools via http://vcftools.sourceforge.net   Scaling issues with VCF – BCF format in development

Petr Danecek


VCF (useful) Bloat

Every release of 1000G adds more tags to VCF files   ##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">   ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">   ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">   ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">   ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">   ##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">   ##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">   ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">   ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">   ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">   ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">   ##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">

UK10K propose rich annotation of VCF files  Known SNPs/indels

 RS IDs, G1K unancessioned SNPs  Geographical information

 Ensembl annotation (coding, exonic, intronic, UTR, splice..)  microRNA, eQTL, known disease loci

 Coding consequences  Synonymous/non-synonymous, splice, stop, GERP score

 Functional interpretation  Polyphen, Sift, PANTHER


Storage Challenges

Storage  Try to reduce the proportion of raw data we keep (e.g. images, OQ in

BAM, remove base calls in BAM etc.)  However there’s still a LOT of data to store and analyse!  Estimation for our group based on ~200Tbp of sequencing data over next

2-3 years  1.5 Pbytes

 Permanent: Lane alignments, transposed BAMs, horizontal BAMs, bi-monthly releases, backup of lane BAMs, Variant calls

 Transient: Library BAMs, Local assemblies

 Storage type optimality criteria  Cost per Tbyte  Proximity to compute resources  Scalability – room for expansion/future proofing  I/O throughput  Disaster recovery


A Tiered Solution

3 tiered storage model Trade off cost, quantity, i/o throughput Similar to caching strategies in computer design  Level 0: Local disk, closest proximity to CPU, intermediate temp files e.g.

local assemblies, reference files  Level 1: High-performance, highly parallel, close proximity to compute,

expensive, suitable for high i/o tasks  Level 2: Mid-tier storage, some type of nfs technology, discrete units with

some local compute, suitable for low i/o tasks that are compute intensive, scalable by adding more discrete units

 Level 3: High latency storage, warehouse storage, not suitable to compute against, occasional access e.g. old data releases  (Level 3a: Off-site replication of data in level 3)


A Tiered Solution

3Gb/sec

800Mb/sec

Cost

2

1

1

CP

U Farm

Size

1

2

2

Level 1   Data: Current release horizontal + transposed BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)

Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment

Level 3   Data: Old release BAMs + variant calls backup

Level 1: High performance

Level 2: Middle tier/nfs

Level 3: Backup/warehouse Level 3a: Off-site replication


Compute Challenges

Compute  New algorithms continually developed for more accurate variant calling  2010 several new processes added into production pipeline

 BAM Improvement  Local realignment around indels to correct mapping biases (e.g. GATK)  Adding BAQ differences up front

 Indel calling by local assembly/alternative haplotype analysis (e.g. dindel)  Local reassembly of SV breakpoints

 Easy to estimate runtime for known processes (e.g. mapping, recalibration, duplicate removal)  Challenge to estimate runtime for next 2-3 years for new algorithms  E.g. more use of assembly methods – more complex references?

I/O has become a significant bottleneck and is most difficult thing to measure  All computations need to minimise I/O

 E.g. transforming BAM files to different sort orders


Project Data Release

Do we need to release BAMs? Large scale human phenotype driven sequencing projects going forward  Participants are more interested in the variants than the raw data

BAM files may contain too much data and too large to ship around amongst project members UK10K proposals  Lane BAM files submitted to the archives  Not release BAM files via project ftp  Project data release comprise solely of annotated VCF files  Raw data can be obtained from the archives

1000g/uk10k: bioinformatics, storage, and compute challenges of large scale resequencing

Documents

bam files

storage challenges storage

data formats

lot of data

local compute

g bam file evolutions

proportion of raw data

base calls