file types in bioinformatics · file types in bioinformatics 150915 martin dahlö ... sam –...

31

Upload: others

Post on 05-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file
Page 2: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

File Types in Bioinformatics

150915Martin Dahlö

[email protected]

Page 3: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

http://xkcd.com

Page 4: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

Overwhelming at first Overview

FASTA – reference sequences FASTQ – reads in raw form SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file GTF/GFF/BED – annotations

Page 5: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

FASTA

Used for: nucleotide or peptide sequences Simple structure

> header

sequence

Page 6: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

FASTA

Used for: nucleotide or peptide sequences Simple structure

Page 7: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

FASTQ

Just like FASTA, but with quality values Used for: raw data from sequencing (unaligned reads)

@ header

sequence

+

quality

Page 8: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

FASTQ

Just like FASTA, but with quality values Used for: raw data from sequencing (unaligned reads)

Page 9: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

FASTQ

Quality 0-40 (Illumina 1.8+ = 41)

40 = best

Page 10: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

FASTQ

Quality 0-40 (Illumina 1.8+ = 41)

40 = best

ASCII encoded

Page 11: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

FASTQ

Quality 0-40 (Illumina 1.8+ = 41)

40 = best

ASCII encoded

Page 12: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

FASTQ

Quality 0-40 (Illumina 1.8+ = 41)

40 = best

ASCII encoded

Page 13: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

SAM

Used for: aligned reads Lots of columns..

Page 14: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

SAM

Page 15: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

SAM

Used for: aligned reads Lots of columns..

Read name

Start position bp chr Sequence Quality

Page 16: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

BAM

Binary SAM (compressed) 25% of the size SAMtools to convert .bai = BAM index

Page 17: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file
Page 18: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

BAM

Random order Have to sort before indexing

Page 19: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

BAM

Random order Have to sort before indexing

Page 20: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

BAM

Random order Have to sort before indexing

Chr1 Chr2 Chr3 Chr4 Chr5

Page 21: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

BAM

Page 22: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

BAM

Page 23: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

BAM

Page 24: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

CRAM

Very complex format Used together with a reference genome

Page 25: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

CRAM

Quality scores? 3 modes:

Lossless Binned No quality

Page 26: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 32 33 34 35 36 37 38 39 40 41

1-5 6-10 11-15 16-20 21-25 26-30 31-35 35-40 41-45

Page 27: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

CRAM

Quality scores? 3 modes:

Lossless Binned No quality

Not widespread, yet

Page 28: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

GTF/GFF/BED

Used for: annotations Simple structure

Usually:

chr start stop extra info

Page 29: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

GTF/GFF/BED

Used for: annotations Simple structure

Usually:

chr start stop extra info

BED

Page 30: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

GTF/GFF/BED

Used for: annotations Simple structure

Usually:

chr start stop extra info

GFF

Page 31: File Types in Bioinformatics · File Types in Bioinformatics 150915 Martin Dahlö ... SAM – aligned reads BAM – compressed SAM file CRAM – even more compressed SAM file

Laboratory time! (yet again)