from sequence to expression: a probabilistic framework

33
From Sequence to Expression: A Probabilistic Framework Eran Segal (Stanford) Joint work with: Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Nir Friedman (Hebrew U.) Daphne Koller

Upload: infinity

Post on 09-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

From Sequence to Expression: A Probabilistic Framework. Joint work with:. Eran Segal (Stanford). Nir Friedman (Hebrew U.) Daphne Koller (Stanford). Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Inst.). G1. S. M. G2. Understanding Cellular Processes. - PowerPoint PPT Presentation

TRANSCRIPT

  • From Sequence to Expression:A Probabilistic FrameworkEran Segal (Stanford)Joint work with:Yoseph Barash (Hebrew U.)Itamar Simon (Whitehead Inst.)Nir Friedman (Hebrew U.)Daphne Koller (Stanford)

  • Understanding Cellular ProcessesComplex biological processes (e.g. cell cycle)Coordination of multiple eventsEach event requires different modulesCan we recover the regulatory circuits that control such processes?

  • Gene Structure

  • Gene RegulationAmRNA

  • Gene RegulationAmRNA

  • Gene RegulationSwi5mRNA

  • Gene RegulationSwi5mRNA

  • Gene RegulationmRNA

  • GoalACTAGTGCTGACTATTATTGCACTGATGCTAGC+

  • Model of Gene RegulationGeneExperimentExpressionSequenceProbabilistic Relational Models (PRMs)Pfeffer and Koller (1998) Friedman et al (1999) Segal et al (2001)Promoter sequencesRegulation by transcription factorsExpression measurements Context Cluster

  • Regulation to ExpressionLevelGeneExperimentExpressionR(t1)R(t2)Exp. typeR(t1) = yes t1 regulates gene R(t1) = no t1 does not regulate geneExp. cluster

  • Regulation to ExpressionLevelGeneExperimentExpressionR(t1)R(t2)Exp. typeExp. cluster

  • Modeling Context SpecificityLevelGeneExperimentExpressionR(t1)Exp. typeExp. clusterR(t2)

  • Sequence ModelLevelGeneExperimentExpressionR(t1)R(t2)Exp. typeSequenceAssumptions: Binding site is of length kBinding may occur at any k-merTF regulates gene if binding occurs anywhereExp. cluster

  • From Sequence to Regulation Assumptions: Binding site is of length kBinding may occur at any k-merTF regulates gene if binding occurs anywhere

  • From Sequence to Regulation Model for one gene g, promoter region of length 5 and k=2

  • Joint Probabilistic ModelLevelGeneExperimentExpressionR(t1)R(t2)Exp. typeExp. Clusterk-mers1skB(t1)B(t2)

  • Localization Assay

  • Localization AssaySwi5Induce TF protein level

  • Localization AssaySwi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5 TF binds to targetsInduce TF protein level

  • Localization Assay

    Measure TF binding to promoter of every geneAssign confidence for each bindingSwi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Swi5Gene BoundGene Not Bound TF binds to targetsInduce TF protein level

  • Localization AssaySimon et al (2001)Localization data: measure TF binding to promoter of each gene (assign binding confidence)

  • Is Regulation Observed?Not quite

    Localization is measured for specific conditionsLocalization is measured for large DNA regionsLocalization is noisy

  • Incorporating LocalizationLevelGeneExperimentExpressionR(t1)R(t2)Exp. typeExp. ClusterL(t1)L(t2)Observed localizationLocalization p-value is noisy sensor of actual regulationIf regulation occurs, p-value likely to be lowIf no regulation, p-value likely to be high

  • Localization ModelGeneR(t1)L(t1)Localization p-value is noisy sensor of actual regulationIf regulation occurs, p-value likely to be lowIf no regulation, p-value likely to be high Observed

  • Joint Probabilistic ModelLevelGeneExperimentExpressionR(t1)R(t2)Exp. typeExp. Clusterpromoters1skL(t1)L(t2)

  • Learning the ModelsExperimental DetailsLocalization Data

  • Learning the ModelsExperimental DetailsLocalization Data

  • Model LearningStructure Learning:Tree structureMissing Data:Experiment clusterRegulation variablesMotif Model:Parameter estimation

  • Model LearningGeneExpressionR(t2)R(t1)ExperimentExp. typeLevelpromoters1skExp. clusterL(t1)L(t1)

  • Resulting Bayesian NetworkLevel1,2R(t2)1R(t1)1Exp. typeExp. type2Level1,1Level2,2R(t2)2R(t1)2Level2,1Level3,2R(t2)3R(t1)3Level3,1L(t2)1L(t1)1L(t2)2L(t1)2L(t2)3L(t1)3s11sk1s12sk2s13sk3Exp. clusterExp. cluster

  • Model Learning: E-StepLevel1,2R(t2)1R(t1)1Exp. typeExp. type2Level1,1Level2,2R(t2)2R(t1)2Level2,1Level3,2R(t2)3R(t1)3Level3,1L(t2)1L(t1)1L(t2)2L(t1)2L(t2)3L(t1)3s11sk1s12sk2s13sk3Exp. clusterExp. clusterLoopy belief propagation

  • Model Learning: M-StepLevel1,2R(t2)1R(t1)1Exp. typeExp. type2Level1,1Level2,2R(t2)2R(t1)2Level2,1Level3,2R(t2)3R(t1)3Level3,1L(t2)1L(t1)1L(t2)2L(t1)2L(t2)3L(t1)3s11sk1s12sk2s13sk3Exp. clusterExp. clusterStandard ML estimationConjugate Gradient

  • Experimental ResultsYeastCell Cycle expression data (Spellman et al)Localization data for 9 TFs (Simon et al)Yeast genome (promoters)

  • GeneralizationLevelGeneExpressionR(t1)R(t2)ExperimentExp. ClusterGene log-likelihood-112.24

  • GeneralizationLevelGeneExpressionL(t1)L(t2)ExperimentExp. typeGene log-likelihoodLocalization-121.48-112.24

  • GeneralizationLevelGeneExpressionR(t1)R(t2)ExperimentExp. typeExp. ClusterGene log-likelihood-112.24

  • GeneralizationLevelGeneExpressionR(t1)R(t2)ExperimentExp. typeExp. ClusterGene log-likelihood-112.24

  • Generating Hypotheses

  • Expression vs Regulation02142638410510701001301601902202500306090120150090180270360alphacdc15cdc28elu-1-0.500.51Genes predicted to be regulated by Swi5 are probably real Swi5 targets

  • Combinatorial Effects02142638410510701001301601902202500306090120150090180270360alphacdc15cdc28elu-1-0.500.51PhaseFkh2 & Swi4Fkh2 & Ndd1

  • Combinatorial Effects-1-0.500.5102142638410510701001301601902202500306090120150090180270360alphacdc15cdc28eluMcm1 & Ndd1Mcm1 & Ace2Mcm1 & Swi5Phase

  • Localization Assignment Changes

  • Motifs FoundNdd1Simon et al.Expanded SetRemaining Genes17128Expanded set identified additional genes regulated by Ndd1

  • TFSimonExpandedRestP-ValueAce210911.4e-6Fkh1292584.4e-10Fkh22929105.4e-11Mbp1665681.9e-45Mcm1282424.2e-18Ndd1172811.9e-24Swi4413756.4e-26Swi5282324.9e-15Swi6505262.3e-48

  • Induced Interaction NetworkTF pairs whose regulation predicts expression of same gene cluster Ace2Swi5Ndd1Fkh2Fkh1Swi4Swi6Mcm1Mbp1MG1G2S

  • ConclusionsUnified probabilistic model explaining gene regulation using sequence, localization and expression dataModels complex interactions between regulatorsDiscriminative model maximizing P(Expr. | Seq.)Sequence data helps explain expression patterns

  • Big PictureGoal: unified probabilistic frameworkModels complex biological domainsIncorporates heterogeneous dataFramework incorporates explicitly within model basic biological building blocks:Genes, TFs, proteins, patients, cells, species, Much closer connection between biology and modelCan read biology directly from modelCan incorporate prior knowledge easilyCan explicitly represent and learn biological models

    This could be a model of the system: the expression level of a particular gene, in a specific context, defined by the experiment type, depends on whether its regulated and by which TFs. This regulation event, for each TF, depends in turn on the presence of the DNA binding domain for that TF in the upstream sequence of the geneSo, are we done? Well, not quite biologists dont know how to build this model. Instead