next-generation sequencing workshop rna-seq mapping center for bioinformatics hanqing zhao...
TRANSCRIPT
TOPHATNext-Generation Sequencing Workshop
RNA-Seq Mapping
Center for BioinformaticsHanqing Zhao
2011-07-11
Missions for RNA-Seq Mapping
Before Tophat
• Previous software for aligning RNA-Seq data relies on known splice junctions and cannot identify novel ones.
TOPHAT
• Tophat is designed to align reads from RNA-Seq experiment to a reference genome without relying on known splicing sites.
• Tophat is free and available from http://tophat.cbcb.umd.edu
Patterns of alternative splicing
Xing et al. 2006
Tophat pipeline
Trapnell et al. 2009
Step I: mapping with Bowtie
Adjustable parameters:-mismatches-multireads
Step II. island assembly
1. Use Maq assembly module to produce pseudo-consensus exons (islands).
2. Use reference genome to call bases.
3. Merge exon gaps(6bp).4. Elongate 45bp to both
sides of each islands.
Step III. Creating candidate junction database
• TopHat first enumerates all canonical donor and acceptor sites within the island sequences (as well as their reverse complements).
• Next, it considers all pairings of these sites that could form canonical (GT–AG) introns between neighboring (but not necessarily adjacent) islands.
• By default, TopHat only examines potential introns longer than 70 bp and shorter than 20 000 bp.
Single island junctions
• In order to detect such junctions without sacrificing performance and specificity, TopHat looks for introns within islands that are deeply sequenced.
• Each possible intron is checked against the IUM reads for reads that span the splice junction.
• The seed-and-extend strategy is used to match reads to possible splice sites.
Step IV. Looking for junction reads
Step V. Filtering false junctions
• Wang et al. (2008) observed that 86% of the minor isoforms were expressed at least 15% of the level of the major isoform.
• For each junction, the average depth of read coverage is computed for the left and right flanking regions of the junction separately.
• The number of alignments crossing the junction is divided by the coverage of the more deeply covered side to obtain an estimate of the minor isoform frequency.
• 15% is the default cut-off.
Old Tophat’s pipeline
Trapnell et al. 2009
Reads are becoming longer, and paired-sequencing are more and more common …
Current Tophat (latest 1.3.1)
Tophat
Segment Search
Butterfly search
Closure search
Coverage Search
Gene model annotations
I. Segment search
--segment-length--segment-mismatches
--min-segment-intron--max-segment-intron
I. Segment search
II. Closure search
• Closure search is only used when TopHat is run with paired end reads• Closure search should only be used when the expected inner distance between
mates is small (<= 50bp)
--closure-search--no-coverage-search--min-closure-intron--max-closure-intron
III. Coverage search
--coverage-search :disabled for reads 75bp or longer--no-coverage-search--min-coverage-intron--max-coverage-intron
IV. Butterfly search
--butterfly-searchConsider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts.
V. Junction annotations
--no-novel-junctions Only look for reads across junctions indicated in the supplied GFF or junctions file.
-G/--GTF <GTF 2 or GFF3>-j/--raw-juncs <.juncs file>.
Input tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]
• Reference sequence indexed by bowtie_index• Fastq sequences– Quality format ?
phred33 (default)--solexa-quals--solexa1.3-quals
– Paired-ends ?– Strand-specific ?– Multi-files ?
• The software is optimized for reads 75bp or longer.
• Mixing paired- and single- end reads together is not supported.
Strand-specific data
--library-type TopHat will treat the reads as strand specific.
Paired-end data
-r/--mate-inner-dist <int>This is the expected (mean) inner distance between mate pairs.
--mate-std-dev <int>The standard deviation for the distribution on inner distances between mate pairs.
Other parameters
--bowtie-n (after tophat 1.3.0)-g/--max-multihits
-a/--min-anchor-length (>=3, default 8)-m/--splice-mismatches (default 0)-F/--min-isoform-fraction <0.0-1.0>
-p/--num-threads--keep-tmp
Output
• accepted_hits.bam A list of read alignments in SAM format.
• junctions.bed• insertions.bed • deletions.bed
References
• Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
• Tophat manual http://tophat.cbcb.umd.edu/manual.html• Further Readings : Tophat-fusion
Practice time
All the files are at:ngs_vm1:• Reference sequence: REF.fa• RNA-seq data 1: SampleA.Run01 SampleA.Run02
– paired-end – 50nt at each end– Phred33 quality
• RNA-seq data 2 SampleB.Run01– paired-end– 75nt at each end– strand-specific – solexa1.3-quals
• Index the genome sequence bowtie-build REF.fa REF
• Run tophat tophat --version # update is frequent; version is important
tophat # go through all the parameters tophat \
-o sampleA.ouput \-r 100 \--mate-std-dev 30 \REF \SampleA.Run01.1.fastq,SampleA.Run02.1.fastq \SampleA.Run01.2.fastq,SampleA.Run02.2.fastq
• Run tophat tophat \
-o sampleB.ouput \-r 50 \--mate-std-dev 30 \--library-type fr-firststrand \ --solexa1.3-quals \REF \SampleB.Run01.1.fastq \SampleB.Run01.2.fastq