next-generation sequencing workshop rna-seq mapping center for bioinformatics hanqing zhao...

TOPHATNext-Generation Sequencing Workshop

RNA-Seq Mapping

Center for BioinformaticsHanqing Zhao

2011-07-11

Missions for RNA-Seq Mapping

Before Tophat

• Previous software for aligning RNA-Seq data relies on known splice junctions and cannot identify novel ones.

TOPHAT

• Tophat is designed to align reads from RNA-Seq experiment to a reference genome without relying on known splicing sites.

• Tophat is free and available from http://tophat.cbcb.umd.edu

Patterns of alternative splicing

Xing et al. 2006

Tophat pipeline

Trapnell et al. 2009

Step I: mapping with Bowtie

Adjustable parameters:-mismatches-multireads

Step II. island assembly

1. Use Maq assembly module to produce pseudo-consensus exons (islands).

2. Use reference genome to call bases.

3. Merge exon gaps(6bp).4. Elongate 45bp to both

sides of each islands.

Step III. Creating candidate junction database

• TopHat first enumerates all canonical donor and acceptor sites within the island sequences (as well as their reverse complements).

• Next, it considers all pairings of these sites that could form canonical (GT–AG) introns between neighboring (but not necessarily adjacent) islands.

• By default, TopHat only examines potential introns longer than 70 bp and shorter than 20 000 bp.

Single island junctions

• In order to detect such junctions without sacrificing performance and specificity, TopHat looks for introns within islands that are deeply sequenced.

• Each possible intron is checked against the IUM reads for reads that span the splice junction.

• The seed-and-extend strategy is used to match reads to possible splice sites.

Step IV. Looking for junction reads

Step V. Filtering false junctions

• Wang et al. (2008) observed that 86% of the minor isoforms were expressed at least 15% of the level of the major isoform.

• For each junction, the average depth of read coverage is computed for the left and right flanking regions of the junction separately.

• The number of alignments crossing the junction is divided by the coverage of the more deeply covered side to obtain an estimate of the minor isoform frequency.

• 15% is the default cut-off.

Old Tophat’s pipeline

Trapnell et al. 2009

Reads are becoming longer, and paired-sequencing are more and more common …

Current Tophat (latest 1.3.1)

Tophat

Segment Search

Butterfly search

Closure search

Coverage Search

Gene model annotations

I. Segment search

--segment-length--segment-mismatches

--min-segment-intron--max-segment-intron

I. Segment search

II. Closure search

• Closure search is only used when TopHat is run with paired end reads• Closure search should only be used when the expected inner distance between

mates is small (<= 50bp)

--closure-search--no-coverage-search--min-closure-intron--max-closure-intron

III. Coverage search

--coverage-search :disabled for reads 75bp or longer--no-coverage-search--min-coverage-intron--max-coverage-intron

IV. Butterfly search

--butterfly-searchConsider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts.

V. Junction annotations

--no-novel-junctions Only look for reads across junctions indicated in the supplied GFF or junctions file.

-G/--GTF <GTF 2 or GFF3>-j/--raw-juncs <.juncs file>.

Input tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]

• Reference sequence indexed by bowtie_index• Fastq sequences– Quality format ?

phred33 (default)--solexa-quals--solexa1.3-quals

– Paired-ends ?– Strand-specific ?– Multi-files ?

• The software is optimized for reads 75bp or longer.

• Mixing paired- and single- end reads together is not supported.

Strand-specific data

--library-type TopHat will treat the reads as strand specific.

Paired-end data

-r/--mate-inner-dist <int>This is the expected (mean) inner distance between mate pairs.

--mate-std-dev <int>The standard deviation for the distribution on inner distances between mate pairs.

Other parameters

--bowtie-n (after tophat 1.3.0)-g/--max-multihits

-a/--min-anchor-length (>=3, default 8)-m/--splice-mismatches (default 0)-F/--min-isoform-fraction <0.0-1.0>

-p/--num-threads--keep-tmp

Output

• accepted_hits.bam A list of read alignments in SAM format.

• junctions.bed• insertions.bed • deletions.bed

References

• Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120

• Tophat manual http://tophat.cbcb.umd.edu/manual.html• Further Readings ： Tophat-fusion

Practice time

All the files are at:ngs_vm1:• Reference sequence: REF.fa• RNA-seq data 1: SampleA.Run01 SampleA.Run02

– paired-end – 50nt at each end– Phred33 quality

• RNA-seq data 2 SampleB.Run01– paired-end– 75nt at each end– strand-specific – solexa1.3-quals

• Index the genome sequence bowtie-build REF.fa REF

• Run tophat tophat --version # update is frequent; version is important

tophat # go through all the parameters tophat \

-o sampleA.ouput \-r 100 \--mate-std-dev 30 \REF \SampleA.Run01.1.fastq,SampleA.Run02.1.fastq \SampleA.Run01.2.fastq,SampleA.Run02.2.fastq

• Run tophat tophat \

-o sampleB.ouput \-r 50 \--mate-std-dev 30 \--library-type fr-firststrand \ --solexa1.3-quals \REF \SampleB.Run01.1.fastq \SampleB.Run01.2.fastq

next-generation sequencing workshop rna-seq mapping center for bioinformatics hanqing zhao...

tophat tophat

segment search slide

intron slide

common slide

tmp slide

rnaseq mapping slide

mismatches multireads

closure search closure

Documents

chinese zhao

patrick x. zhao, ph. d. the zhao bioinformatics lab...

doug brutlag 2011 bioinformatics genomics, bioinformatics

zhao linna

zhao danyao

1. introduction to biology and bioinformatics ·...

xinyue zhao

[s1a-1] history-based article quality assessment on...

construction of a microrna-mrna network underlying...

portfolio_lin zhao

zhao zhang

marina zhao

fabricio breve 1,2 fabricio@rc.unesp.br liang zhao 2...

bioinformatics, translational bioinformatics, personalized...

identidad zhao

introduction to bioinformatics introduction to...

zhao, jing

what bioinformatics? what is bioinformatics?

bioinformatics 2013 li bioinformatics btt029

catherine zhao