mapping ngs sequences to a reference genome. why? resequencing studies (dna) – structural...

16
Mapping NGS sequences to a reference genome

Upload: irene-owens

Post on 25-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Mapping NGS sequences to a reference genome

Page 2: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Why?

• Resequencing studies (DNA)– Structural variation– SNP identification

• RNAseq– Mapping transcripts to a genome sequence• Genome annotation• Transcript enumeration• Identification of splice junctions/variants

Page 3: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Blast is too slow

• Different alignment algorithms are necessary• Burrows Wheeler Alignment– sequence database (genome) is transformed to

produce an index– Individual sequence reads are searched against

this index• STAR Aligner (Dobin et al. 2012) Bioinformatics

– Uncompressed Suffix trees

Page 4: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

BWT of “banana”

Page 5: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Tophat2

• Based on the Bowtie alignment engine– Bowtie, matching with no gaps– Tophat2, gapped matches

• Aligns reads to a Burrows Wheeler transformed index of the genome

• 1st pass non-gapped matches• 2nd pass splits unmapped reads and

attempts to align the fragments

Page 6: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

• Start at the first base of sequence read• Find Maximal Mappable Prefix (MMP)• Repeat process using unmapped portion of read

• 50x faster than other aligners

The STAR Aligner

Page 7: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

OUTPUTS

• TopHat (Bowtie)– .bam file (binary alignment/map)– .sam (sequence alignment/map)

– Single .sam file entry:

I8MVR:53:837 0 17_dna:chromosome 14090858 25521M * 0 0 TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*-

NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...hXR:Z:CT XG:Z:CT

Page 8: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

.sam fields

Page 9: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

.sam flags

1. 12. 23. 1+24. 0+45. 1+46. 0+2+47. 1+2+48. 0+89. 1+810. 0+2+811. 1+2+812. 0+4+813. 1+4+814. 0+2+4+815. 1+2+4+816. …etc.

Page 10: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

CIGAR format

I8MVR:104:144 0 7_dna:chromosome 120102744 255 62M1I14M * 0 0 GGTTTTTTGGAAGAGTAGTTCGCGTTTCATTAATTAGTTATTTTTTAGTTTTTAAATAAAATAAAATTTTAAAAAAA

Page 11: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Quantifying alignments

• How many reads overlap a given interval on a chromosome (scaffold)?

• How do these regions correspond to known genes?– .gtf file

• How many transcripts from my gene of interest?

• How confident can I be about a variant call?

Page 12: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Annotate regions - GTF files1 2 3 4 5 6 7 8 9

Chromosome_8.1 Cufflinks transcript 90162 90766 1000 + .

gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM "110.6292802224"; frac "1.000000"; conf_lo "41.668327";

conf_hi "132.581041"; cov "6.415537";

Chromosome_8.1 Cufflinks exon 90162 90231 1000 + .

gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM "110.6292802224"; frac "1.000000"; conf_lo

"41.668327"; conf_hi "132.581041"; cov "6.415537";

Chromosome_8.1 Cufflinks exon 90314 90766 1000 + .

gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "2"; FPKM "110.6292802224"; frac "1.000000"; conf_lo

"41.668327"; conf_hi "132.581041"; cov "6.415537";

Chromosome_8.1 Cufflinks transcript 90889 91620 1000 . .

gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM "49.8117204717"; frac "1.000000"; conf_lo "21.651798";

conf_hi "73.074820"; cov "2.193724";

GTF fields1. Sequence ID2. Source3. Feature4. Start5. End

6. Score7. Strand8. Frame9. Attribute

Page 13: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Variant Calling

• .bam/.sam file contains all of the information required to call variants

• Variant calls can’t be extracted from the .bam file• Must provide the genome sequence

I8MVR:53:837 0 17_dna:chromosome 14090858 25521M * 0 0 TAACTACGAATACCTGTCGAT **%-**,00%-

*-%---*-*- NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h XR:Z:CT XG:Z:CT

Page 14: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Today’s exercises

Page 15: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Variant Analysis

• Extract variant information from provided .bam file

• Examine output file and learn about the information contained in the various fields

Page 16: Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts

Introducing… Dr. Eric Rouchka

• Bioinformatics Core Director• Department of Computer Engineering and

Computer Science• University of Louisville• Kentucky Biomedical Research Infrastructure

Network