sequence analysis with scripture manuel garber. the dynamic genome repaired transcribed wrapped in...

Download Sequence analysis with Scripture Manuel Garber. The dynamic genome Repaired Transcribed Wrapped in marked histones Folded into a fractal globule Bound

If you can't read please download the document

Upload: cynthia-osmond

Post on 14-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1

Sequence analysis with Scripture Manuel Garber Slide 2 The dynamic genome Repaired Transcribed Wrapped in marked histones Folded into a fractal globule Bound by Transcription Factors Slide 3 With a very dynamic epigenetic state Catherine Dulac, Nature 2010 Slide 4 Which define the state of its functional elements Motivation: find the genome state using sequencing data Slide 5 We can use sequencing to find the genome state (Protein-DNA) ChIP-Seq Histone Marks Transcription Factors Park, P Slide 6 We can use sequencing to find the genome state RNA-Seq Transcription Wang, Z Nature Reviews Genetics 2009 Slide 7 Once sequenced the problem becomes computational Sequenced reads cells sequencer cDNA ChIP genome read coverage Alignment Slide 8 Overview of the session Well cover the 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originated the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples. Slide 9 Trapnell, Salzberg, Nature Biotechnology 2009 Slide 10 Short read mapping software for ChIP-Seq Seed-extendShort indelsUse base qualB-WUse base qual MaqNoYESBWAYES BFASTYesNOBowtieNO GASSSTYesNOSoap2NO RMAPYesYES SeqMapYesNO SHRiMPYesNO Slide 11 What software to use If read quality is good (error rate < 1%) and there is a reference. BWA is a very good choice. If read quality is not good or the reference is phylogenetically far (e.g. Wolf to dog) and you have a server with enough memory SHRiMP or BFAST should be a sensitive but relatively fast choice. What about RNA-Seq? Slide 12 RNA-Seq read mapping is more complex 10s kb100s bp RNA-Seq reads can be spliced, and spliced reads are most informative Slide 13 Method 1: Seed-extend spliced alignment Slide 14 Method 1I: Exon-first spliced alignment Slide 15 Short read mapping software for RNA-Seq Seed-extendShort indelsUse base qualExon-firstUse base qual GSNAPNoNOMapSpliceNO QPALMAYesNOSpliceMapNO STAMPYYesYESTopHatNO BLATYesNO Exon-first alignments will map contiguous first at the expense of spliced hits Slide 16 Exon-first aligners are faster but at cost How do we visualize the results of these programs Slide 17 The Broad Institute of MIT and Harvard A desktop application for the visualization and interactive exploration of genomic data IGV: Integrative Genomics Viewer Microarrays Epigenomics RNA-Seq NGS alignments Comparative genomics Slide 18 Visualizing read alignments with IGV Long marks Medium marks Punctuate marks Slide 19 Visualizing read alignments with IGV RNASeq Gap between reads spanning exons Slide 20 Visualizing read alignments with IGV RNASeq close-up What are the gray reads? We will revisit later. Slide 21 Visualizing read alignments with IGV zooming out How can we identify regions enriched in sequencing reads? Slide 22 Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples. Slide 23 Scripture was originally designed to identify ChIP-Seq peaks Mikkelsen et al. 2007 Goal: Identify regions enriched in the chromatin mark of interest Challenge: As we saw, Chromatin marks come in very different forms. Slide 24 Chromatin domains demarcate interesting surprises in the transcriptome XIST ??? K4me3 K36me3 Slide 25 Scripture is a method to solve this general question How can we identify these chromatin marks and the genes within? Short modification Long modification Discontinuous data Slide 26 We have an efficient way to compute read count p-values Our approach =0.05 Permutation Poisson Slide 27 The genome is big: A lot happens by chance So we want to compute a multiple hypothesis correction Expected ~150,000,000 bases Slide 28 Bonferroni correction is way to conservative Bonferroni corrects the number of hits but misses many true hits because its too conservative How do we get more power? Correction factor 3,000,000,000 Slide 29 Given a region of size w and an observed read count n. What is the probability that one or more of the 3x10 9 regions of size w has read count >= n under the null distribution? Controlling FWER Count distribution (Poisson) Max Count distribution =0.05 We could go back to our permutations and compute an FWER: max of the genome-wide distributions of same sized region) but really really really slow!!! FWER =0.05 Slide 30 Thankfully, there is a distribution called the Scan Distribution which computes a closed form for this distribution. ACCOUNTS for dependency of overlapping windows thus more powerful! =0.05 FWER =0.05 Scan distribution, an old problem Poisson distribution Scan distribution Is the observed number of read counts over our region of interest high? Given a set of Geiger counts across a region find clusters of high radioactivity Are there time intervals where assembly line errors are high? Slide 31 Scan distribution for a Poisson process The probability of observing k reads on a window of size w in a genome of size L given a total of N reads can be approximated by (Alm 1983): where The scan distribution gives a computationally very efficient way to estimate the FWER Slide 32 By utilizing the dependency of overlapping windows we have greater power, while still controlling the same genome-wide false positive rate. Slide 33 Example : PolII ChIP Significant windows using the FWER corrected p-value Merge Trim Segmentation method for contiguous regions But, which window? Slide 34 We use multiple windows Small windows detect small punctuate regions. Longer windows can detect regions of moderate enrichmentover long spans. In practice we scan different windows, finding significant onesin each scan. In practice, it helps to use some prior information in pickingthe windows although globally it might be ok. Slide 35 Applying Scripture to a variety of ChIP-Seq data 100 bp windows 200, 500 & 1000 bp windows Slide 36 lincRNA K4me3 K36me3 Application of scripture to mouse chromatin state maps Identifed ~1500 lincRNAs Conserved Noncoding Robustly expressed lincRNA Mitch Guttman Slide 37 Can we identify enriched regions across different data types? Short modification Long modification Discontinuous data: RNA-Seq to find gene structures for this gene-like regions Using chromatin signatures we discovered hundreds of putative genes. What is their structure? Slide 38 Scripture for RNA-Seq: Extending segmentation to discontiguous regions Slide 39 The transcript reconstruction problem 10s kb100s bp Challenges: Genes exist at many different expression levels, spanning several orders ofmagnitude. Reads originate from both mature mRNA (exons) and immature mRNA (introns)and it can be problematic to distinguish between them. Reads are short and genes can have many isoforms making it challenging todetermine which isoform produced each read. There are two main approaches to this problem, first lets discuss Scriptures Slide 40 Scripture: A statistical genome-guided transcriptome reconstruction Statistical segmentation of chromatin modifications uses continuity of segments to increase power for interval detection If we know the connectivity of fragments, we can increase our power to detect transcripts RNA-Seq Slide 41 Longer (76) reads provide increased number of junction reads intron Exon junction spanning reads provide the connectivity information. Slide 42 Exon-exon junctions Alternative isoforms ES RNASeq 76 base reads Aligned with TopHat Aligned read Gap Protein coding gene with 2 isoforms Read coverage The power of spliced alignments Slide 43 Step 1: Align Reads to the genome allowing gaps flanked by splice sites The connectivity graph connects all bases that are directly connected within the transcriptome Step 2: Build an oriented connectivity map oriented every spliced alignment using the flanking motifs genome Statistical reconstruction of the transcriptome Slide 44 Step 3: Identify segments across the graph Step 4: Find significant transcripts Statistical reconstruction of the transcriptome Slide 45 Merge windows & build transcript graph Filter & report isoforms Scripture Overview Map reads Scan discontiguous windows Slide 46 Can we identify enriched regions across different data types? Short modification Long modification Discontinuous data Are we really sure reconstructions are complete? Slide 47 RNA-Seq data is incomplete for comprehensive annotation Library construction can help provide more information. More on this later Slide 48 Applying scripture: Annotating the mouse transcriptome Slide 49 Reconstructing the transcriptome of mouse cell types Mouse Cell Types SequenceReconstruct Slide 50 Sensitivity across expression levels Even at low expression (20 th percentile), we have: average coverage of transcript is ~95% and 60% have full coverage 20 th percentile Slide 51 Sensitivity at low expression levels improves with depth As coverage increases we are able to fully reconstruct a larger percentage of known protein-coding genes Slide 52 Novel variation in protein-coding genes 1,804 1,310 588 3,137 2,477 903 3 cell typesES cells Slide 53 Novel variation in protein-coding genes ~85% contain K4me3 Slide 54 Novel variation in protein-coding genes ~50% contain polyA motif Compared to ~6% for random Slide 55 Novel variation in protein-coding genes ~80% retain ORF Slide 56 What about novel genes? Class 2: Large Intergenic ncRNA (lincRNA) Class 1: Overlapping ncRNA Class 3: Novel protein-coding genes Slide 57 Class 1: Overlapping ncRNA 201446 3 cell typesES cells Slide 58 Overlapping ncRNAs: Assessing their evolutionary conservation SiPhy-Garber et al. 2009 Conservation HighLow Overlapping ncRNAs show little evolutionary conservation Slide 59 What about novel genes? Class 2: Large Intergenic ncRNA (lincRNA) Class 1: Overlapping ncRNA Class 3: Novel protein-coding genes Slide 60 Class 2: Intergenic ncRNA (lincRNA) ~500~1500 3 cell typesES cells Slide 61 lincRNAs: How do we know they are non-coding? >95% do not encode proteins ORF LengthCSF (ORF Conservation) Slide 62 lincRNAs: Assessing their evolutionary conservation Conservation HighLow Slide 63 What about novel genes? Class 2: Large Intergenic ncRNA (lincRNA) Class 1: Overlapping ncRNA Class 3: Novel protein-coding genes Slide 64 Finding novel coding genes ~40 novel protein-coding genes Slide 65 lincRNAs: Comparison to K4-K36 RNA-Seq reconstruction and chromatin signature synergize to identify lincRNAs Chromatin Approach RNA-Seq First Approach 80% have reconstructions 85% have chromatin Slide 66 Other transcript reconstruction methods Slide 67 Method I: Direct assembly Slide 68 Method II: Genome-guided Slide 69 Transcriptome reconstruction method summary Slide 70 Pros and cons of each approach Transcript assembly methods are the obvious choice fororganisms without a reference sequence. Genome-guided approaches are ideal for annotating high-quality genomes and expanding the catalog of expressedtranscripts and comparing transcriptomes of different celltypes or conditions. Hybrid approaches for lesser quality or transcriptomes thatunderwent major rearrangements, such as in cancer cell. More than 1000 fold variability in expression leves makesassembly a harder problem for transcriptome assemblycompared with regular genome assembly. Genome guided methods are very sensitive to alignmentartifacts. Slide 71 RNA-Seq transcript reconstruction software AssemblyPublishedGenome Guided OasisNOCufflinks Trans-ABySSYESScripture TrinityNO Slide 72 Differences between Cufflinks and Scripture Scripture was designed with annotation in mind. It reports all possible transcripts that are significantly expressed given the aligned data ( Maximum sensitivity ). Cuffl links was designed with quantification in mind. It limits reported isoforms to the minimal number that explains thedata ( Maximum precision ). Slide 73 Differences between Cufflinks and Scripture - Example Annotation Scripture Cufflinks Alignments Slide 74 Different ways to go at the problem CufflinksScripture Total loci called27,70019,400 Median # isoforms per loci11 3 rd quantile # isoforms per loci12 Max # isoforms191,400 Slide 75 Different ways to go at the problem Many of the bogus locus and isoforms are due to alignment artifacts Slide 76 Why so many isoforms Annotation Reconstructions Every such splicing event or alignment artifact doubles the number of isoforms reported Slide 77 Alignment revisited spliced alignment is still work in progress Slide 78 Exon-first aligners are faster but at cost Alignment artifacts can also decrease sensitivity Slide 79 Sfrs3 TopHat New Pipeline Read mapped uniquely Read ambiguously mapped Missing spliced reads for highly expressed genes Slide 80 Use gapped aligners (e.g. BLAT) to map reads Align all reads with BLAT Filter hits and build candidate junction database from BLAT hits (Scripturelight). Use a short read aligner (Bowtie) to map reads against the connectivity graphinferred transcriptome Map transcriptome alignments to the genome Can more sensitive alignments overcome this problem? Slide 81 Sfrs3 TopHat BLAT pipeline ` ScriptAlign: Can increase alignment across junctions Slide 82 Map first reconstruction approaches directly benefit with mapping improvements We even get more uniquely aligned reads (not just spliced reads) ScriptAlign: Can increase alignment across junctions Slide 83 Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples. Slide 84 Quantification Fragmentation of transcripts results in length bias: longer transcripts have higher counts Different experiments have different yields. Normalization is required for cross lane comparisons: Reads per kilobase of exonic sequence per million mapped reads (Mortazavi et al Nature methods 2008) This is all good when genes have one isoform. Slide 85 Quantification with multiple isoforms How do we define the gene expression? How do we compute the expression of each isoform? Slide 86 Computing gene expression Idea1: RPKM of the constitutive reads (Neuma, Alexa-Seq, Scripture) Slide 87 Computing gene expression isoform deconvolution Slide 88 If we knew the origin of the reads we could compute each isoforms expression. The genes expression would be the sum of the expression of all its isoforms. E = RPKM 1 + RPKM 2 + RPKM 3 Slide 89 Programs to measure transcript expression Implemented method Alexa-seqGene expression by constitutive exons ERANGEGene expression by using all Exons ScriptureGene expression by constitutive exons CufflinksTranscript deconvolution by solving the maximum likelihood problem MISOTranscript deconvolution by solving the maximum likelihood problem RSEMTranscript deconvolution by solving the maximum likelihood problem Slide 90 Impact of library construction methods Slide 91 Library construction improvements Paired-end sequencing Adapted from the Helicos website Slide 92 Paired-end reads are easier to associate to isoforms P1P1 P2P2 P3P3 Isoform 1 Isoform 2 Isoform 3 Paired ends increase isoform deconvolution confidence P 1 originates from isoform 1 or 2 but not 3. P 2 and P 3 originate from isoform 1 Do paired-end reads also help identifying reads originating in isoform 3? Slide 93 We can estimate the insert size distribution P1P1 P2P2 d1d1 d2d2 Splice and compute insert distance Estimate insert size empirical distribution Get all single isoform reconstructions Slide 94 and use it for probabilistic read assignment Isoform 1 Isoform 2 Isoform 3 d1d1 d2d2 d1d1 d2d2 P(d > d i ) Slide 95 And improve quantification Katz et al Nature Methods 2008 Slide 96 Paired-end improve reconstructions Paired-end data complements the connectivity graph Slide 97 And merge regions Single reads Paired reads Slide 98 Or split regions Single reads Paired reads Slide 99 Summary Paired-end reads are now routine in Illumina and SOLiD sequencers. Paired end alignment is supported by most short read aligners Transcript quantification depends heavily in paired-end data Transcript reconstruction is greatly improved when using paired-ends (work in progress) Slide 100 Giving orientation to transcripts Strand specific libraries Scripture relies on splice motifs to orient transcripts. It orients every edge in the connectivity graph. Single exon genes are left unoriented ? Slide 101 Strand specific library construction results in oriented reads. Sequence depends on the adapters ligated The second strand is destroyed, thus the cDNA read is always in reverse orientation to the RNA Scripture & Cufflinks allow the user to specify the orientation of the reads. Adapted from Levine et al Nature Methods Slide 102 The libraries we will work with are strand sepcific Slide 103 Summary Several methods now exist to build strand sepecific RNA-Seqlibraries. Quantification methods support strand specific libraries. For example Scripture will compute expression on both strand ifdesired. Slide 104 In the works Complementing RNA-Seq libraries with 3 and 5 taggedsegments to pinpoint the start and end of reconstructions. Slide 105 Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples. Slide 106 The problem. Finding genes that have different expression between two or moreconditions. Find gene with isoforms expressed at different levels between twoor more conditions. Find differentially used slicing events Find alternatively used transcription start sites Find alternatively used 3 UTRs Slide 107 Differential gene expression using RNA-Seq (Normalized) read counts Hybridization intensity We observe the individual events. Slide 108 The Poisson model Suppose you have 2 conditions and R replicates for each conditions and each replicate in its own lane L. Lets consider a single gene G. Let C ik the number of reads aligned to G in lane i of condition k then (k=1,2) and i=(1,R). Assume for simplicity that all lanes give the same number of reads (otherwise introduce a normalization constant) Assume C ik distributes Poisson with unknown mean m ik. Use a GLN to estimate m ik using two parameters, a gene dependent parameter a and a sample dependent parameter s k log(m ik ) = a + s k to obtain two estimators m 1 and m 2 Alternatively estimate a mean m using all replicates for all conditions Slide 109 The Poisson model G is differentially expressed when m 1 != m 2 Is P(C 1k,C 2k |m) is close to P(C 1k |m 1 )P(C 2k |m 2 ) The likelihood ratio test is ideal to see this and since the difference between the two models is one variable it distributes X 2 of degree 1. The X 2 can be used to assess significance. For details see Auer and Doerge - Statistical Design and Analysis of RNA Sequencing Data genomics 2010. Marioni et al RNASeq: An assessment technical of reproducibility and comparison with gene expression arrays Genome Reasearch 2008. Slide 110 Cufflinks differential issoform ussage Let a gene G have n isoforms and let p 1, , p n the estimated fraction of expression of each isoform. Call this a the isoform expression distribution P for G Given two samples we the differential isoform usage amounts to determine whether H 0 : P 1 = P 2 or H 1 : P 1 != P 2 are true. To compare distributions Cufflinks utilizes an information content based metric of how different two distributions are called the Jensen-Shannon divergence: The square root of the JS distributes normal. Slide 111 RNA-Seq differential expression software Underlying modelNotes DegSeqNormal. Mean and variance estimated from replicates Works directly from reference transcriptome and read alignment EdgeRNegative BionomialGene expression table DESeqPoissonGene expression table MyrnaEmpiricalSequence reads and reference transcriptome Slide 112 Dendritic cells are one of the major regulators of inflammation in our body DC guardians of homeostasis pamlpspiccpg Liew et al 2005 Mouse Dendritic Cell stimulation timecourse Slide 113 Acknowledgements Mitchell Guttman New Contributors: Moran Cabili Hayden Metsky RNA-Seq analysis: Cole Trapnel Manfred Grabher Could not do it without the support of: Ido Amit Eric Lander Aviv Regev Slide 114 What are the gray colored reads? This is a paired-end library, gray reads aligned without their mate Slide 115 RPKM We now can include novel genes in differential expression analysis Slide 116 Tiny modification (TFs) Long modifications (K36/K27) Short modifications (K4) Discontinuous (RNA) Alignment Tiling arrays Scripture: Genomic jack of all trades Slide 117 Conclusions A method for transcriptome reconstruction Lots of cell-type specific variation in protein- coding annotations www.broadinstitute.org/scripture Chromatin and RNA maps synergize to annotated the genome Slide 118 Significant windows Merge & trim In practice we scan a range of windows and integrated the results of each scan Longer windows for dimmer but longer regions Slide 119 But which window? Slide 120 This works given a known window size how do we find this So since the Scan Dist depends on 3 params (genome, number of reads, and window size) we can vary window size and get an adjusted dist for each We can then scan the genome using different window sizes allowing the distributions to account directly for the variable enrichment sizes that we observe. So in practice, we scan with multiple different sizes (from low-> high) and merge results While the scan dists accounts for these different sizes, there are practical which some times can confound. For example, if one has short enrichment which are close together they can be merged by long windows bec they make the long window significant. In practice, it helps to use some prior information in picking the windows although globally it might be ok.