next-generation sequencing workshop rna-seq mapping center for bioinformatics hanqing zhao...

TOPHATNext-Generation Sequencing Workshop

RNA-Seq Mapping

Center for BioinformaticsHanqing Zhao

2011-07-11

Missions for RNA-Seq Mapping

Before Tophat

• Previous software for aligning RNA-Seq data relies on known splice junctions and cannot identify novel ones.

TOPHAT

• Tophat is designed to align reads from RNA-Seq experiment to a reference genome without relying on known splicing sites.

• Tophat is free and available from http://tophat.cbcb.umd.edu

http://tophat.cbcb.umd.edu/

Patterns of alternative splicing

Xing et al. 2006

Tophat pipeline

Trapnell et al. 2009

Step I: mapping with Bowtie

Adjustable parameters:-mismatches-multireads

Step II. island assembly

1. Use Maq assembly module to produce pseudo-consensus exons (islands).

2. Use reference genome to call bases.

3. Merge exon gaps(6bp).4. Elongate 45bp to both

sides of each islands.

Step III. Creating candidate junction database

• TopHat first enumerates all canonical donor and acceptor sites within the island sequences (as well as their reverse complements).

• Next, it considers all pairings of these sites that could form canonical (GT–AG) introns between neighboring (but not necessarily adjacent) islands.

• By default, TopHat only examines potential introns longer than 70 bp and shorter than 20 000 bp.

Single island junctions

• In order to detect such junctions without sacrificing performance and specificity, TopHat looks for introns within islands that are deeply sequenced.

• Each possible intron is checked against the IUM reads for reads that span the splice junction.

• The seed-and-extend strategy is used to match reads to possible splice sites.

Step IV. Looking for junction reads

Step V. Filtering false junctions

• Wang et al. (2008) observed that 86% of the minor isoforms were expressed at least 15% of the level of the major isoform.

• For each junction, the average depth of read coverage is computed for the left and right flanking regions of the junction separately.

• The number of alignments crossing the junction is divided by the coverage of the more deeply covered side to obtain an estimate of the minor isoform frequency.

• 15% is the default cut-off.

Old Tophat’s pipeline

Trapnell et al. 2009

Reads are becoming longer, and paired-sequencing are more and more common …

Current Tophat (latest 1.3.1)

Tophat

Segment Search

Butterfly search

Closure search

Coverage Search

Gene model annotations

I. Segment search

--segment-length--segment-mismatches

--min-segment-intron--max-segment-intron

I. Segment search

II. Closure search

• Closure search is only used when TopHat is run with paired end reads• Closure search should only be used when the expected inner distance between

mates is small (<= 50bp)

--closure-search--no-coverage-search--min-closure-intron--max-closure-intron

III. Coverage search

--coverage-search :disabled for reads 75bp or longer--no-coverage-search--min-coverage-intron--max-coverage-intron

IV. Butterfly search

--butterfly-searchConsider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts.

V. Junction annotations

--no-novel-junctions Only look for reads across junctions indicated in the supplied GFF or junctions file.

-G/--GTF <GTF 2 or GFF3>-j/--raw-juncs <.juncs file>.

Input tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]

• Reference sequence indexed by bowtie_index• Fastq sequences– Quality format ?

phred33 (default)--solexa-quals--solexa1.3-quals

– Paired-ends ?– Strand-specific ?– Multi-files ?

• The software is optimized for reads 75bp or longer.

• Mixing paired- and single- end reads together is not supported.

Strand-specific data

--library-type TopHat will treat the reads as strand specific.

Paired-end data

-r/--mate-inner-dist <int>This is the expected (mean) inner distance between mate pairs.

--mate-std-dev <int>The standard deviation for the distribution on inner distances between mate pairs.

Other parameters

--bowtie-n (after tophat 1.3.0)-g/--max-multihits

-a/--min-anchor-length (>=3, default 8)-m/--splice-mismatches (default 0)-F/--min-isoform-fraction <0.0-1.0>

-p/--num-threads--keep-tmp

Output

• accepted_hits.bam A list of read alignments in SAM format.

• junctions.bed• insertions.bed • deletions.bed

References

• Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120

• Tophat manual http://tophat.cbcb.umd.edu/manual.html• Further Readings ： Tophat-fusion

http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp120

http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp120

http://bioinformatics.oxfordjournals.org/

http://tophat.cbcb.umd.edu/manual.html

Practice time

All the files are at:ngs_vm1:• Reference sequence: REF.fa• RNA-seq data 1: SampleA.Run01 SampleA.Run02

– paired-end – 50nt at each end– Phred33 quality

• RNA-seq data 2 SampleB.Run01– paired-end– 75nt at each end– strand-specific – solexa1.3-quals

• Index the genome sequence bowtie-build REF.fa REF

• Run tophat tophat --version # update is frequent; version is important

tophat # go through all the parameters tophat \

-o sampleA.ouput \-r 100 \--mate-std-dev 30 \REF \SampleA.Run01.1.fastq,SampleA.Run02.1.fastq \SampleA.Run01.2.fastq,SampleA.Run02.2.fastq

• Run tophat tophat \

-o sampleB.ouput \-r 50 \--mate-std-dev 30 \--library-type fr-firststrand \ --solexa1.3-quals \REF \SampleB.Run01.1.fastq \SampleB.Run01.2.fastq

next-generation sequencing workshop rna-seq mapping center for bioinformatics hanqing zhao...

Documents

tophat tophat

segment search slide

intron slide

common slide

tmp slide

rnaseq mapping slide

mismatches multireads

closure search closure