next-generation sequencing workshop rna-seq mapping center for bioinformatics hanqing zhao...

33
TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Upload: holly-may

Post on 18-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

TOPHATNext-Generation Sequencing Workshop

RNA-Seq Mapping

Center for BioinformaticsHanqing Zhao

2011-07-11

Page 2: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Missions for RNA-Seq Mapping

Page 3: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Before Tophat

• Previous software for aligning RNA-Seq data relies on known splice junctions and cannot identify novel ones.

Page 4: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

TOPHAT

• Tophat is designed to align reads from RNA-Seq experiment to a reference genome without relying on known splicing sites.

• Tophat is free and available from http://tophat.cbcb.umd.edu

Page 5: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Patterns of alternative splicing

Xing et al. 2006

Page 6: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Tophat pipeline

Trapnell et al. 2009

Page 7: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Step I: mapping with Bowtie

Adjustable parameters:-mismatches-multireads

Page 8: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Step II. island assembly

1. Use Maq assembly module to produce pseudo-consensus exons (islands).

2. Use reference genome to call bases.

3. Merge exon gaps(6bp).4. Elongate 45bp to both

sides of each islands.

Page 9: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Step III. Creating candidate junction database

• TopHat first enumerates all canonical donor and acceptor sites within the island sequences (as well as their reverse complements).

• Next, it considers all pairings of these sites that could form canonical (GT–AG) introns between neighboring (but not necessarily adjacent) islands.

• By default, TopHat only examines potential introns longer than 70 bp and shorter than 20 000 bp.

Page 10: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Single island junctions

• In order to detect such junctions without sacrificing performance and specificity, TopHat looks for introns within islands that are deeply sequenced.

Page 11: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

• Each possible intron is checked against the IUM reads for reads that span the splice junction.

• The seed-and-extend strategy is used to match reads to possible splice sites.

Step IV. Looking for junction reads

Page 12: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Step V. Filtering false junctions

• Wang et al. (2008) observed that 86% of the minor isoforms were expressed at least 15% of the level of the major isoform.

• For each junction, the average depth of read coverage is computed for the left and right flanking regions of the junction separately.

• The number of alignments crossing the junction is divided by the coverage of the more deeply covered side to obtain an estimate of the minor isoform frequency.

• 15% is the default cut-off.

Page 13: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Old Tophat’s pipeline

Trapnell et al. 2009

Page 14: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Reads are becoming longer, and paired-sequencing are more and more common …

Page 15: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Current Tophat (latest 1.3.1)

Tophat

Segment Search

Butterfly search

Closure search

Coverage Search

Gene model annotations

Page 16: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

I. Segment search

--segment-length--segment-mismatches

--min-segment-intron--max-segment-intron

Page 17: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

I. Segment search

Page 18: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

II. Closure search

• Closure search is only used when TopHat is run with paired end reads• Closure search should only be used when the expected inner distance between

mates is small (<= 50bp)

--closure-search--no-coverage-search--min-closure-intron--max-closure-intron

Page 19: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

III. Coverage search

--coverage-search :disabled for reads 75bp or longer--no-coverage-search--min-coverage-intron--max-coverage-intron

Page 20: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

IV. Butterfly search

--butterfly-searchConsider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts.

Page 21: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

V. Junction annotations

--no-novel-junctions Only look for reads across junctions indicated in the supplied GFF or junctions file.

-G/--GTF <GTF 2 or GFF3>-j/--raw-juncs <.juncs file>.

Page 22: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Input tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]

• Reference sequence indexed by bowtie_index• Fastq sequences– Quality format ?

phred33 (default)--solexa-quals--solexa1.3-quals

– Paired-ends ?– Strand-specific ?– Multi-files ?

Page 23: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

• The software is optimized for reads 75bp or longer.

• Mixing paired- and single- end reads together is not supported.

Page 24: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Strand-specific data

--library-type TopHat will treat the reads as strand specific.

Page 25: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Paired-end data

-r/--mate-inner-dist <int>This is the expected (mean) inner distance between mate pairs.

--mate-std-dev <int>The standard deviation for the distribution on inner distances between mate pairs.

Page 26: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Other parameters

--bowtie-n (after tophat 1.3.0)-g/--max-multihits

-a/--min-anchor-length (>=3, default 8)-m/--splice-mismatches (default 0)-F/--min-isoform-fraction <0.0-1.0>

-p/--num-threads--keep-tmp

Page 27: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Output

• accepted_hits.bam A list of read alignments in SAM format.

• junctions.bed• insertions.bed • deletions.bed

Page 28: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

References

• Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120

• Tophat manual http://tophat.cbcb.umd.edu/manual.html• Further Readings : Tophat-fusion

Page 29: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

Practice time

All the files are at:ngs_vm1:• Reference sequence: REF.fa• RNA-seq data 1: SampleA.Run01 SampleA.Run02

– paired-end – 50nt at each end– Phred33 quality

• RNA-seq data 2 SampleB.Run01– paired-end– 75nt at each end– strand-specific – solexa1.3-quals

Page 30: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

• Index the genome sequence bowtie-build REF.fa REF

Page 31: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

• Run tophat tophat --version # update is frequent; version is important

tophat # go through all the parameters tophat \

-o sampleA.ouput \-r 100 \--mate-std-dev 30 \REF \SampleA.Run01.1.fastq,SampleA.Run02.1.fastq \SampleA.Run01.2.fastq,SampleA.Run02.2.fastq

Page 32: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11

• Run tophat tophat \

-o sampleB.ouput \-r 50 \--mate-std-dev 30 \--library-type fr-firststrand \ --solexa1.3-quals \REF \SampleB.Run01.1.fastq \SampleB.Run01.2.fastq

Page 33: Next-Generation Sequencing Workshop RNA-Seq Mapping Center for Bioinformatics Hanqing Zhao 2011-07-11