ebi is an outstation of the european molecular biology laboratory. cram: reference-based compression...
TRANSCRIPT
EBI is an Outstation of the European Molecular Biology Laboratory.
CRAM: reference-based compression formatdeveloped by Vadim Zalunin
Data horror
EMBL-EBI10 petabytesSRA~1 petabytes
Over 2 million DVDs or 2.5km
Complete Genomics0.5 TB for a single file
Compression, when we know what to expect.
BMP, 145 kb PNG, 2 kb JPG, 6 kb JPG, 3 kb
LOSSLESS LOSSY
But the actual message is only 40 characters (bytes) long!
Compression at it’s best
IMAGE, 145 kb
"Five little ducks went swimming one day"
TEXT, 40 b IMAGE, 145 kb
~3500 times more efficient
compress uncompress
What are we talking about
sample
sequencing machines
bug
bunch of huge files
The bug’s DNA is hidden somewhere
Looking closer at the data
bunch of huge files
read 1read 2read 3…..read bizzilion
It boils down to a long list of reads:
Each read represents a short nucleotide sequence from the genome.
Additional information may be attached to it, for example error estimates.
What is a Read?
@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…
An excerpt from of a FASTQ file.
What is a Read?
@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…
read name
An excerpt from of a FASTQ file.
What is a Read?
@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…
read name read bases
An excerpt from of a FASTQ file.
Bases: ACGTN
What is a Read?
@SRR081241.20758946CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG…+IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI…
read name read bases
read quality scores
An excerpt from of a FASTQ file.
Bases: ACGTN
Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)
What is quality score?
Then quality score is phred quality score encoded as ASCII symbols 33-126.
Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.
Reference based encoding
Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G
read 1 T G A G C T C T T A G T A G C read 2 G C T C T A A G T A G C C G C read 3 C T C T A A G T A G C C G C G read 4 G T A G C C G C G G A C T G T read 5 C G G T C T G T C C G
Read start position Read end position
Reference based encoding
Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G
read 1 . . . . . . . . T . . . . . . read 2 . . . . . . . . . . . . . . . read 3 . . . . . . . . . . . . . . . read 4 . . . . . . . . . . A . . . . read 5 . . . . . . . . . . .
Reference based encoding
Reference sequence T G A G C T C T A A G T A C C C G C G G T C T G T C C G
read 1 . . . . . . . . T . . . . . . read 2 . . . . . . . . . . . . . . . read 3 . . . . . . . . . . . . . . . read 4 . . . . . . . . . . A . . . . read 5 . . . . . . . . . . .
Mismatching bases
Lossy quality scores
Approach 1Quality scores are usually values from 0 to 39.
Let’s shrink them, so that they are from 0 to 7 now.
Approach 2Let’s treat quality scores using alignment information.
For example: preserve only quality scores for mismatching bases.
horizontal
vert
ical
compress uncompress
Comparison study:1K Genomes exomes
BAM BAMCRAM
Some analysis pipeline
Some analysis pipeline
compress uncompress
Comparison study:1K Genomes exomes
BAM BAMCRAM
Some analysis pipeline
Some analysis pipeline
Original SNPs Restored SNPs
CRAM NGS data compression
Do nothingDo nothing
CRAM lossyUntreated
CRAM very lossy
LosslessLossless LossyLossy
Bits/base
CRAM lossless
(bad) (good)
Progressive application of compression
Sample value
Sam
ple accessibility
200-fold
Lossless
2-fold
20-fold
Hard
High
Easy
Low
References
More information:
http://www.ebi.ac.uk/ena/about/cram_toolkit
Mailing list:
http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev
Publications:
Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40
Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1