hugo: hierarchical multi -reference genome compression tool for aligned short reads

1
integrating Data for Analysis, Anonymization, and SHaring Supported by the NIH Grant U54HL108460 to the University of California, San Diego (.bam /.sam ) Sequence O ther Fields Q uality value M atched reads M ism atched read U se another reference and shorten the read length K -m eans clustering Encode aligning inform ation Encoder Output1 Output2 Output4 Encoder D elta code H uffm an code G am m a code LZW code etc. Output3 FastR eference based A lignm enttool R eadslength > threshold? Yes No HUGO: Hierarchical mUlti-reference Genome cOmpression tool for aligned short reads Pinghao Li, 1 Xiaoqian Jiang, 2 Shuang Wang, 2 Jihoon Kim, 2 Hongkai Xiong, 1 and Lucila Ohno-Machado 2 1 EE Department, Shanghai Jiaotong University, Shanghai, China 2 Division of Biomedical Informatics, University of California–San Diego, La Jolla, California, USA Introduction HUGO framework Experimental Results Summary of Conclusions Storage and transmission are important challenges in the use of large sequencing ‘Big Data’. We developed a novel compression technique, the HUGO framework, for compressing aligned reads. Our method also presents an innovative way of hierarchically matching gradually shortened reads in order to make full use of available reference genomes. Our experiments compared the performance of our algorithm with other state-of-the-art compression algorithms, such as CRAM, to which ours was superior, and Samcomp, which had similar compression performance. Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data. We developed Hierarchical mUlti-reference Genome cOmpression (HUGO) [1], a novel compression algorithm for aligned reads in the Sequence Alignment/Map (SAM) format. We first aligned short reads against a reference genome and stored exactly mapped reads for compression. For the inexact mapped or unmapped reads, we realigned them against different reference genomes using an adaptive scheme by gradually shortening the read length. Regarding the base quality value, we offer lossy and lossless compression mechanisms. The lossy compression mechanism for the base quality values uses k-means clustering, where a user can adjust the balance between decompression quality and compression rate. The lossless compression can be produced by setting k (the number of clusters) to the number of different quality values. Reference s Image source: http://www.ncbi.nlm.nih.gov/Traces/sra/i/g.png # Name Description 1 QNAME Query NAME of the read or the read pair 2 FLAG Bitwise FLAG (pairing, strand, mate strand, etc.) 3 RNAME Reference sequence NAME 4 POS 1-based leftmost POSition of clipped alignment 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR Extended CIGAR string (operations: MIDNSHP) 7 MRNM Mate Reference NaMe (‘=’ if same as RNAME) 8 MPOS 1-based leftmost Mate POSition 9 ISIZE Inferred Insert SIZE 10 SEQ Query SEQuence on the same strand as the reference 11 QUAL Query QUALity (ASCII-33=Phred base quality) 12 OPT Optional fields. Methodology Child Reference from Mother Referenc e from Father EMR: exact mapped read IMR: inexact mapped reads (with less than 4 mismatches) UMR: unmapped reads (with more 4 mismatches) 5 10 15 20 30 40 45 51 0 50 100 150 K M B/M APE Com pressed size Error The compression using k-mean clustering followed by bzip2, where the quantization error is measured by Mean Absolute Percentage Error (MAPE) ID Sequence Name BAM size SAM size 1 NA12878chrom20 356MB 1.58GB 2 HG00096chrom11 661MB 2.65GB 3 HG00103chrom11 717MB 2.91GB 4 HG01028chrom11 964MB 3.95GB 5 NA06984chrom11 1.19GB 5.16GB 6 NA06985chrom11 2.33GB 9.41GB 1 2 0 2 4 6 8 10 Sequence Com pression R atio HUG O w ith lossy com pression bzip2 CRAM SAM com p HUGO HUGO L (30) HUGO L (20) HUGO L (10) HUGO L (1) HUG O lossy SAM com p CRAM bzip2 bzip2 HUG O lossless HUG O lossy SAM com p HUG O lossless 1 2 3 4 5 6 0 5 10 15 20 Sequence M em ory usage in G B E ncoding m em ory usage CRAM Sam com p HUGO 1 2 3 4 5 6 0 2 4 6 8 Sequence M em o ry u sag e in G B D ecoding m em ory usage CRAM Sam com p HUGO Encoding Memory usage Decoding Memory usage ID Program name Reference BAM size Compressed size 3 bzip2 hg19 717 MB 720MB CRAM[2] hg19 453.6MB Samcomp[3] hg19 349MB HUGO hg19 392.5MB HUGO hg19, HuRef 390.2MB 4 bzip2 hg19 964 MB 967 MB CRAM hg19 585MB Samcomp hg19 480MB HUGO hg19 548.1MB HUGO hg19, HuRef 545.3MB 5 bzip2 hg19 1.19 GB 1.192GB CRAM hg19 736.8MB Samcomp hg19 538MB HUGO hg19 653.1MB HUGO hg19, HuRef 650.2MB 6 bzip2 hg19 2.33 GB 2.34GB CRAM hg19 1570MB Samcomp hg19 1247MB HUGO hg19 1496MB HUGO hg19, HuRef 1491MB HUGO Lossless with multi-reference [1] Li, P., Jiang, X., Wang, S., Kim, J., Xiong, H., Ohno- Machado, L.. HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads. Journal of the American Medical Informatics Association, 2013;0:1–11. doi:10.1136/amiajnl- 2013-002147 [2] Fritz MH-Y, Leinonen R, Cochrane G, et al. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 2011;21:734–40. [3] Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PloS ONE 2013;8:e59190.

Upload: vinnie

Post on 22-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

HUGO: Hierarchical mUlti -reference Genome cOmpression tool for aligned short reads . Pinghao Li, 1 Xiaoqian Jiang, 2 Shuang Wang, 2 Jihoon Kim, 2 Hongkai Xiong, 1 and Lucila Ohno-Machado 2. 1 EE Department, Shanghai Jiaotong University, Shanghai, China - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: HUGO: Hierarchical  mUlti -reference Genome  cOmpression  tool for aligned short reads

integrating Data for Analysis,Anonymization, and SHaring

Supported by the NIH Grant U54HL108460 to the University of California, San Diego

Input(.bam/.sam)

Sequence Other Fields Quality value

Matched reads

Mismatchedread

Use another reference and

shorten the read length

K-means clustering

Encode aligning

information

Encoder

Output1 Output2 Output4

Encoder

Delta codeHuffman code Gamma code LZW codeetc.

Output3

Fast Reference based Alignment tool

Reads length> threshold?

Yes

No

HUGO: Hierarchical mUlti-reference Genome cOmpression tool for aligned short reads Pinghao Li,1 Xiaoqian Jiang,2 Shuang Wang,2 Jihoon Kim,2 Hongkai Xiong,1 and Lucila Ohno-Machado2

1EE Department, Shanghai Jiaotong University, Shanghai, China2Division of Biomedical Informatics, University of California–San Diego, La Jolla, California, USA

Introduction HUGO framework

Experimental Results

Summary of Conclusions Storage and transmission are important challenges in the use of large sequencing ‘Big Data’. We developed a novel compression technique, the HUGO framework, for compressing aligned reads. Our method also presents an innovative way of hierarchically matching gradually shortened reads in order to make full use of available reference genomes. Our experiments compared the performance of our algorithm with other state-of-the-art compression algorithms, such as CRAM, to which ours was superior, and Samcomp, which had similar compression performance.

Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data.

We developed Hierarchical mUlti-reference Genome cOmpression (HUGO) [1], a novel compression algorithm for aligned reads in the Sequence Alignment/Map (SAM) format. We first aligned short reads against a reference genome and stored exactly mapped reads for compression. For the inexact mapped or unmapped reads, we realigned them against different reference genomes using an adaptive scheme by gradually shortening the read length. Regarding the base quality value, we offer lossy and lossless compression mechanisms. The lossy compression mechanism for the base quality values uses k-means clustering, where a user can adjust the balance between decompression quality and compression rate. The lossless compression can be produced by setting k (the number of clusters) to the number of different quality values.

Ref

eren

ces

Image source: http://www.ncbi.nlm.nih.gov/Traces/sra/i/g.png

# Name Description1 QNAME Query NAME of the read or the read pair2 FLAG Bitwise FLAG (pairing, strand, mate strand, etc.)3 RNAME Reference sequence NAME4 POS 1-based leftmost POSition of clipped alignment5 MAPQ MAPping Quality (Phred-scaled)6 CIGAR Extended CIGAR string (operations: MIDNSHP)7 MRNM Mate Reference NaMe (‘=’ if same as RNAME)8 MPOS 1-based leftmost Mate POSition9 ISIZE Inferred Insert SIZE10 SEQ Query SEQuence on the same strand as the reference11 QUAL Query QUALity (ASCII-33=Phred base quality)12 OPT Optional fields.

Methodology

Child

Reference from Mother

Reference from Father

EMR: exact mapped readIMR: inexact mapped reads(with less than 4 mismatches) UMR: unmapped reads(with more 4 mismatches)

5 10 15 20 30 40 45 510

50

100

150

K

MB

/MA

PE

Compressed sizeError

The compression using k-mean clustering followed by bzip2, where the quantization error is measured by Mean Absolute Percentage Error (MAPE)

ID Sequence Name BAM size SAM size1 NA12878chrom20 356MB 1.58GB2 HG00096chrom11 661MB 2.65GB3 HG00103chrom11 717MB 2.91GB4 HG01028chrom11 964MB 3.95GB5 NA06984chrom11 1.19GB 5.16GB6 NA06985chrom11 2.33GB 9.41GB

1 20

2

4

6

8

10

Sequence

Com

pres

sion

Rat

io

HUGO with lossy compression

bzip2CRAMSAMcompHUGOHUGOL(30)

HUGOL(20)

HUGOL(10)

HUGOL(1)

HUGO lossy

SAMcompCRAM

bzip2 bzip2HUGO lossless

HUGO lossy

SAMcompHUGO lossless

1 2 3 4 5 60

5

10

15

20

Sequence

Mem

ory

usag

e in

GB

Encoding memory usage

CRAMSamcompHUGO

1 2 3 4 5 60

2

4

6

8

Sequence

Mem

ory

usag

e in

GB

Decoding memory usage

CRAMSamcompHUGO

Encoding Memory usage

Decoding Memory usage

ID Program name Reference BAM size Compressed size

3

bzip2 hg19

717 MB

720MBCRAM[2] hg19 453.6MB

Samcomp[3] hg19 349MBHUGO hg19 392.5MBHUGO hg19, HuRef 390.2MB

4

bzip2 hg19

964 MB

967 MBCRAM hg19 585MB

Samcomp hg19 480MBHUGO hg19 548.1MBHUGO hg19, HuRef 545.3MB

5

bzip2 hg19

1.19 GB

1.192GBCRAM hg19 736.8MB

Samcomp hg19 538MBHUGO hg19 653.1MBHUGO hg19, HuRef 650.2MB

6

bzip2 hg19

2.33 GB

2.34GBCRAM hg19 1570MB

Samcomp hg19 1247MBHUGO hg19 1496MBHUGO hg19, HuRef 1491MB

HUGO Lossless with multi-reference

[1] Li, P., Jiang, X., Wang, S., Kim, J., Xiong, H., Ohno-Machado, L.. HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads. Journal of the American Medical Informatics Association, 2013;0:1–11. doi:10.1136/amiajnl-2013-002147[2] Fritz MH-Y, Leinonen R, Cochrane G, et al. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 2011;21:734–40.[3] Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PloS ONE 2013;8:e59190.