IntroductiontoPathwayandNetworkAnalysis
AlisonMotsinger-Reif,PhDAssociateProfessor
BioinformaticsResearchCenterDepartmentofStatistics
NorthCarolinaStateUniversity
PathwayandNetworkAnalysis• High-throughputgenetic/genomictechnologiesenable
comprehensivemonitoringofabiologicalsystem
• Analysisofhigh-throughputdatatypicallyyieldsalistofdifferentiallyexpressedgenes,proteins,metabolites…– Typicallyprovideslistsofsinglegenes,etc.– Willuse“genes”throughout,butusinginterchangeablymostly
• Thislistoftenfailstoprovidemechanisticinsightsintotheunderlyingbiologyoftheconditionbeingstudied
• Howtoextractmeaningfromalonglistofdifferentiallyexpressedgenesà pathway/networkanalysis
Whatmakesanairplanefly?
Chas'StainlessSteel,MarkThompson'sAirplaneParts,About1000PoundsofStainlessSteelWire,andGagosian'sBeverlyHillsSpace
FromcomponentstonetworksAbiologicalfunctionisaresultofmanyinteractingmoleculesandcannotbeattributedtojustasinglemolecule.
PathwayandNetworkAnalysis• Oneapproach:simplifyanalysisbygroupinglonglistsofindividualgenesintosmallersetsofrelatedgenesreduces thecomplexityofanalysis.– alargenumberofknowledgebasesdevelopedtohelpwiththistask
• Knowledgebases– describebiologicalprocesses,components,orstructuresinwhichindividualgenes\areknowntobeinvolvedin
– howandwheregeneproductsinteractwitheachother
PathwayandNetworkAnalysis
• Analysisatthefunctionallevelisappealingfortworeasons:– First,groupingthousandsofgenesbythepathwaystheyareinvolvedinreducesthecomplexitytojustseveralhundredpathwaysfortheexperiment
– Second,identifyingactivepathwaysthatdifferbetweentwoconditionscanhavemoreexplanatorypowerthanasimplelistofgenes
PathwayandNetworkAnalysis
• Whatkindsofdataisusedforsuchanalysis?– Geneexpressiondata• Microarrays• RNA-seq
– Proteomicdata– Metabolomics data– Singlenucleotidepolymorphisms(SNPs)
– ….
PathwayandNetworkAnalysis
• Whatkindsofquestionscanweask/answerwiththeseapproaches?
PathwayandNetworkAnalysis• Theterm“pathwayanalysis”getsusedoften,andoftenindifferentways– appliedtotheanalysisofGeneOntology(GO)terms(alsoreferredtoasa“geneset”)
– physicalinteractionnetworks(e.g.,protein–proteininteractions)
– kineticsimulationofpathways– steady-statepathwayanalysis(e.g.,flux-balanceanalysis)– inferenceofpathwaysfromexpressionandsequencedata
• Mayormaynotactuallydescribebiologicalpathways
PathwayandNetworkAnalysis
• Forthefirstpartofthismodule,wewillfocusonmethodsthatexploitpathwayknowledgeinpublicrepositoriesratherthanonmethodsthatinferpathwaysfrommolecularmeasurements– UserepositoriessuchasGOorKyotoEncyclopediaofGenesandGenomes(KEGG)
à knowledgebase–drivenpathwayanalysis
AHistoryofPathwayAnalysisApproaches
• Overadecadeofdevelopmentofpathwayanalysisapproaches
• Canberoughly dividedintothreegenerations:– 1st:Over-RepresentationAnalysis(ORA)Approaches
– 2nd :FunctionalClassScoring(FCS)Approaches– 3rd :PathwayTopology(PT)-BasedApproaches
Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.
• Thedatageneratedbyanexperimentusingahigh-throughputtechnology(e.g.,microarray,proteomics,metabolomics),alongwithfunctionalannotations(pathwaydatabase)ofthecorrespondinggenome,areinputtovirtuallyallpathwayanalysismethods.
• ORAmethodsrequire thattheinputisalistofdifferentiallyexpressedgenes• FCSmethodsusetheentiredatamatrixasinput• PT-basedmethodsadditionallyutilize thenumberandtypeofinteractionsbetweengeneproducts,
whichmayormaynotbeapartofapathwaydatabase.• Theresultofeverypathwayanalysismethodisalistofsignificantpathwaysintheconditionunder
study.
Over-RepresentationAnalysis(ORA)Approaches
• Earliestmethodsà over-representationanalysis(ORA)
• Statisticallyevaluatesthefractionofgenesinaparticularpathwayfoundamongthesetofgenesshowingchangesinexpression
• Itisalsoreferredtoas“2×2tablemethod”intheliterature
Over-RepresentationAnalysis(ORA)• Usesoneormorevariationsofthefollowingstrategy:– First,aninputlistiscreatedusingacertainthresholdorcriteria• Forexample,maychoosegenesthataredifferentiallyover- orunder-expressedinagivenconditionatafalsediscoveryrate(FDR)of5%
– Then,foreachpathway,inputgenesthatarepartofthepathwayarecounted
– Thisprocessisrepeatedforanappropriatebackgroundlistofgenes• (e.g.,allgenesmeasuredonamicroarray)
– Next,everypathwayistestedforover- orunder-representationinthelistofinputgenes• Themostcommonlyusedtestsarebasedonthehypergeometric,chi-square,orbinomialdistribution
Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.
LimitationsofORAApproaches• First,thedifferentstatisticsusedbyORAareindependent
ofthemeasuredchanges– (e.g.,hypergeometricdistribution,binomialdistribution,chi-
squaredistribution,etc.)
• Testsconsiderthenumberofgenesalonebutignoreanyvaluesassociatedwiththem– suchasprobeintensities
• Bydiscardingthisdata,ORAtreatseachgeneequally– Informationabouttheextentofregulation(e.g.,fold-changes,
significanceofachange,etc.)canbeusefulinassigningdifferentweightstoinputgenes/pathways
– Thiscanprovidemoreinformation
LimitationsofORAApproaches• Second,ORAtypicallyusesonlythemostsignificantgenesanddiscardstheothers– inputlistofgenesisusuallyobtainedusinganarbitrarythreshold(e.g.,geneswithfold-changeand/orp-values)
• Marginallylesssignificantgenesaremissed,resultingininformationloss– (e.g.,fold-change=1.999orp-value=0.051)– Afewmethodsavoidingthresholds
• Theyuseaniterativeapproachthataddsonegeneatatimetofindasetofgenesforwhichapathwayismostsignificant
LimitationsofORAApproaches• Third,ORAassumesthateachgeneisindependentoftheother
genes
• However,biologyisacomplexwebofinteractionsbetweengeneproductsthatconstitutedifferentpathways– Onegoalmightbetogaininsightsintohowinteractionsbetweengene
productsaremanifestedaschangesinexpression– Astrategythatassumesthegenesareindependentissignificantly
limitedinitsabilitytoprovideinsights
• Furthermore,assumingindependencebetweengenesamountsto“competitivenullhypothesis”testing(morelater),whichignoresthecorrelationstructurebetweengenes– theestimatedsignificanceofapathwaymaybebiasedorincorrect
LimitationsofORAApproaches• Fourth,ORAassumesthateachpathwayisindependentof
otherpathwaysà NOTTRUE!
• Examplesofdependence:– GOdefinesabiologicalprocessasaseriesofevents
accomplishedbyoneormoreorderedassembliesofmolecularfunctions
– ThecellcyclepathwayinKEGGwherethepresenceofagrowthfactoractivatestheMAPKsignalingpathway• This,inturn,activatesthecellcyclepathway
• NoORAmethodsaccountforthisdependencebetweenmolecularfunctionsinGOandsignalingpathwaysinKEGG
FunctionalClassScoring(FCS)Approaches
• Thehypothesisoffunctionalclassscoring(FCS)isthatalthoughlargechangesinindividualgenescanhavesignificanteffectsonpathways,weakerbutcoordinatedchangesinsetsoffunctionallyrelatedgenes(i.e.,pathways)canalsohavesignificanteffects
• Withfewexceptions,allFCSmethodsuseavariationofageneralframeworkthatconsistsofthefollowingthreesteps.
Step1• First,agene-levelstatisticiscomputedusingthemolecularmeasurementsfromanexperiment– Involvescomputingdifferentialexpressionofindividualgenesorproteins
• Statisticscurrentlyusedatgene-levelincludecorrelationofmolecularmeasurementswithphenotype– ANOVA– Q-statistic– signal-to-noiseratio– t-test– Z-score
Step1• Choiceofagene-levelstatisticgenerallyhasanegligibleeffectontheidentificationofsignificantlyenrichedgenesets– However,whentherearefewbiologicalreplicates,aregularizedstatisticmaybebetter
• Untransformedgene-levelstatisticscanfailtoidentifypathwayswithup- anddown-regulatedgenes– Inthiscase,transformationofgene-levelstatistics(e.g.,absolutevalues,squaredvalues,ranks,etc.)isbetter
Step2• Second,thegene-levelstatisticsforallgenesinapathwayareaggregatedintoasinglepathway-levelstatistic– canbemultivariateandaccountforinterdependenciesamonggenes
– canbeunivariate anddisregardinterdependenciesamonggenes
• Thepathway-levelstatisticsusedinclude:– Kolmogorov-Smirnovstatistic– sum,mean,ormedianofgene-levelstatistic– Wilcoxon ranksum– maxmean statistic
Step2• Irrespectiveofitstype,thepowerofapathway-levelstatisticdependson– theproportionofdifferentiallyexpressedgenesinapathway
– thesizeofthepathway– theamountofcorrelationbetweengenesinthepathway
• Univariate statisticsshowmorepoweratstringentcutoffswhenappliedtorealbiologicaldata,andequalpowerasmultivariatestatisticsatlessstringentcutoffs
Step3• Assessingthestatisticalsignificanceofthepathway-levelstatistic
• Whencomputingstatisticalsignificance,thenullhypothesistestedbycurrentpathwayanalysisapproachescanbebroadlydividedintotwocategories:– i)competitivenullhypothesis– ii)self-containednullhypothesis
• Aself-containednullhypothesispermutesclasslabels(i.e.,phenotypes)foreachsampleandcomparesthesetofgenesinagivenpathwaywithitself,whileignoringthegenesthatarenotinthepathway
• Acompetitivenullhypothesispermutesgenelabelsforeachpathway,andcomparesthesetofgenesinthepathwaywithasetofgenesthatarenotinthepathway
Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.
AdvantagesofFCSMethodsFCSmethodsaddressthreelimitationsofORA
1. Don’trequireanarbitrarythresholdfordividingexpressiondataintosignificantandnon-significantpools.
Rather,FCSmethodsuseallavailablemolecularmeasurementsforpathwayanalysis.
2. WhileORAcompletelyignoresmolecularmeasurementswhenidentifyingsignificantpathways,FCSmethodsusethisinformationinordertodetectcoordinatedchangesintheexpressionofgenesinthesamepathway
3. Byconsideringthecoordinatedchangesingeneexpression,FCSmethodsaccountfordependencebetweengenesinapathway
LimitationsofFCSMethods• First,similartoORA,FCSanalyzeseachpathwayindependently– Becauseagenecanfunctioninmorethanonepathway,meaningthatpathwayscancrossandoverlap
– Consequently,inanexperiment,whileonepathwaymaybeaffectedinanexperiment,onemayobserveotherpathwaysbeingsignificantlyaffectedduetothesetofoverlappinggenes
• SuchaphenomenonisverycommonwhenusingtheGOtermstodefinepathwaysduetothehierarchicalnatureoftheGO
LimitationsofFCSMethods• Second,manyFCSmethodsusechangesingeneexpressiontorank
genesinagivenpathway,anddiscardthechangesfromfurtheranalysis– Forinstance,assumethattwogenesinapathway,AandB,are
changingby2-foldand20-fold,respectively– Aslongastheybothhavethesamerespectiveranksincomparison
withothergenesinthepathway,mostFCSmethodswilltreatthemequally,althoughthegenewiththehigherfold-changeshouldprobablygetmoreweight
• Importantly,however,consideringonlytheranksofgenesisalsoadvantageous,asitismorerobusttooutliers.– Anotableexceptiontothisscenarioisapproachesthatusegene-level
statistics(e.g.,t-statistic)tocomputepathway-levelscores.– Forexample,anFCSmethodthatcomputesapathway-levelstatisticas
asumormeanofthegene-levelstatisticaccountsforarelativedifferenceinmeasurements(e.g.,Category,SAFE).
PathwayTopology(PT)-BasedApproaches
• Alargenumberofpubliclyavailablepathwayknowledgebasesprovideinformationbeyondsimplelistsofgenesforeachpathway– KEGG– MetaCyc– Reactome– RegulonDB– STKE– BioCarta– PantherDB– ….
• UnlikeGOandMSigDB,theseknowledgebasesalsoprovideinformationaboutgeneproductsthatinteractwitheachotherinagivenpathway,howtheyinteract(e.g.,activation,inhibition,etc.),andwheretheyinteract(e.g.,cytoplasm,nucleus,etc.)
PathwayTopology(PT)-BasedApproaches
• ORAandFCSmethodsconsideronlythenumberofgenesinapathwayorgenecoexpression toidentifysignificantpathways,andignoretheadditionalinformationavailablefromtheseknowledgebases– Evenifthepathwaysarecompletelyredrawnwithnewlinks
betweenthegenes,aslongastheycontainthesamesetofgenes,ORAandFCSwillproducethesameresults
• Pathwaytopology(PT)-basedmethodshavebeendevelopedtousetheadditionalinformation– PT-basedmethodsareessentiallythesameasFCSmethodsin
thattheyperformthesamethreestepsasFCSmethods– Thekeydifferencebetweenthetwoistheuseofpathway
topologytocomputegene-levelstatistics
PathwayTopology(PT)-BasedApproaches
• Rahnenfuhrer etal.proposedScorePAGE,whichcomputessimilaritybetweeneachpairofgenesinapathway(e.g.,correlation,covariance,etc.)– similaritymeasurementbetweeneachpairofgenesisanalogoustogene-levelstatisticsinFCSmethods
– averagedtocomputeapathway-levelscore
• Insteadofgivingequalweighttoallpairwisesimilarities,ScorePAGE dividesthepairwisesimilaritiesbythenumberofreactionsneededtoconnecttwogenesinagivenpathway
PathwayTopology(PT)-BasedApproaches
• Impactfactor(IF)analysis– IFconsidersthestructureanddynamicsofanentirepathwayby
incorporatinganumberofimportantbiologicalfactors,includingchangesingeneexpression,typesofinteractions,andthepositionsofgenesinapathway
Aliwilltalkmoreabouttheseapproaches indetail!!!
IFAnalysis
• Briefly…– Modelsasignalingpathwayasagraph,wherenodesrepresent
genesandedgesrepresentinteractionsbetweenthem– Definesagene-levelstatistic,calledperturbationfactor(PF)ofa
gene,asasumofitsmeasuredchangeinexpressionandalinearfunctionoftheperturbationfactorsofallgenesinapathway
– BecausethePFofeachgeneisdefinedbyalinearequation,theentirepathwayisdefinedasalinearsystem• addressesloopsinthepathways
– TheIFofapathway(pathway-levelstatistic)isdefinedasasumofPFofallgenesinapathway
PathwayTopology(PT)-BasedApproaches
• FCSmethodsthatusecorrelationsamonggenesimplicitlyassumethattheunderlyingnetwork,asdefinedbythecorrelationstructure,doesnotchangeastheexperimentalconditionschange
• Thisassumptionmaybeinaccurateà PTapproachesimproveonthis
PathwayTopology(PT)-BasedApproaches
• NetGSA accountsforthethechangeincorrelationaswellasthechangeinnetworkstructureasexperimentalconditionschange– likeIFanalysis,modelsgeneexpressionasalinearfunctionofothergenesinthenetwork
• itdiffersfromIFintwoaspects– First,itaccountsforagene'sbaselineexpressionbyrepresentingitasalatentvariableinthemodel
– Second,itrequiresthatthepathwaysberepresentedasdirectedacyclicgraphsDAGs• Ifapathwaycontainscycles,NetGSA requiresadditionallatentvariablesaffectingthenodesinthecycle.
• Incontrast,IFanalysisdoesnotimposeanyconstraintonthestructureofapathway
LimitationsofPT-basedApproaches
• Truepathwaytopologyisdependentonthetypeofcellduetocell-specificgeneexpressionprofilesandconditionbeingstudied– informationisrarelyavailable– fragmentedinknowledgebasesifavailable– Asannotationsimprove,theseapproachesareexpectedto
becomemoreuseful
• Inabilitytomodeldynamicstatesofasystem
• Inabilitytoconsiderinteractionsbetweenpathwaysduetoweakinter-pathwaylinkstoaccountforinterdependencebetweenpathways
Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.
RRRRpackagenetgsa
OutstandingChallenges
• BroadCategories:1. annotationchallenges2. methodologicalchallenges
OutstandingChallenges
• Nextgenerationapproacheswillrequireimprovementoftheexistingannotations– necessarytocreateaccurate,highresolutionknowledgebaseswithdetailedcondition-,tissue-,andcell-specificfunctionsofeachgene• PharmGKB ….
– theseknowledgebaseswillallowinvestigatorstomodelanorganism'sbiologyasadynamicsystem,andwillhelppredictchangesinthesystemduetofactorssuchasmutationsorenvironmentalchanges
AnnotationChallenges
• Lowresolutionknowledgebases• Incompleteandinaccurateannotations• Missingcondition- andcell-specificinformation
Greenarrowsrepresentabundantlyavailableinformation, andredarrowsrepresentmissing and/orincompleteinformation.Theultimategoalofpathwayanalysis istoanalyzeabiological systemasalarge,singlenetwork.However,thelinks betweensmallerindividual pathwaysarenotyetwellknown.Furthermore, theeffectsofaSNPonagivenpathwayarealsomissing fromcurrentknowledgebases.Whilesomepathwaysareknown toberelatedtoafewdiseases,itisnotclearwhetherthechangesinpathwaysarethecauseforthose diseases orthedownstreameffectsofthediseases.
LowResolutionKnowledgeBases• Knowledgebasesnotashighresolutionastechnologies– usingRNA-seq,morethan90%ofthehumangenomeisestimatedtobealternativelyspliced
– multipletranscriptsfromthesamegenemayhaverelated,distinct,orevenopposingfunctions
– GWAShaveidentifiedalargenumberofSNPsthatmaybeinvolvedindifferentconditionsanddiseases.
– However,currentknowledgebasesonlyspecifywhichgenesareactiveinagivenpathway
– Essentialthattheyalsobeginspecifyingotherinformation,suchastranscriptsthatareactiveinagivenpathwayorhowagivenSNPaffectsapathway
LowResolutionKnowledgeBases• Becauseoftheselowresolutionknowledgebases,every
availablepathwayanalysistoolfirstmapstheinputtoanon-redundantnamespace,typicallyanEntrez GeneID– thistypeofmappingisadvantageous,althoughitcanbenon-
trivial,asitallowstheexistingpathwayanalysisapproachestobeindependentofthetechnologyusedintheexperiment
– However,mappinginthiswayalsoresultsinthelossofimportantinformationthatmayhavebeenprovidedbecauseaspecifictechnologywasused• XRN2a,avariantofgeneXRN2,isexpressed inseveralhumantissues,whereasanothervariantofthesamegene,XRN2b,ismainlyexpressed inbloodleukocytes
• AlthoughRNA-seq canquantifyexpressionofbothvariants,mappingbothtranscriptstoasinglegenecauses lossoftissue-specificinformation,andpossiblyevencondition-specific information
LowResolutionKnowledgeBases• Therefore,beforepathwayanalysiscanexploitcurrentandfuturetechnologicaladvancesinbiotechnology,itiscriticallyimportanttoannotateexacttranscriptsandSNPsthatparticipateinagivenpathway
• Whilenewapproachesarebeingdevelopedinthisregard,theymaynotyetbeadequate– Braunetal.proposedamethodforanalyzingSNPdatafromaGWAS
– StillreliesonmappingmultipleSNPstoasinglegene,followedbygene-to-pathwaymapping
IncompleteandInaccurateAnnotation
• A surprisinglylargenumberofgenesarestillnotannotated
• Manyofthegenesarehypothetical,predicted,orpseudogenes– Althoughthenumberofprotein-codinggenesinthehumangenomeis
estimatedtobebetween20,000and25,000,accordingEntrezGene,thereare45,283humangenes,ofwhich14,162arepseudogenes
– Onecouldarguethatthepseudogenes shouldnotbeincludedwhenevaluatingfunctionalannotationcoverage
– pseudogene-derivedsmallinterferingRNAshavebeenshowntoregulategeneexpressioninmouseoocytes
– GOprovidesannotationsfor271pseudogenes– AwidelyusedDNAmicroarray,Affymetrix HGU133plus2.0,contains
1,026probesetsthatcorrespondto823pseudogenes– Shouldpseudogenes beincludedinthecountwhenestimating
annotationcoverageforthehumangenome?
IncompleteandInaccurateAnnotationNumberofGO-annotatedgenes(leftpanel)andnumberofGOannotations(rightpanel)forhumanfromJanuary2003toNovember2009.Astheestimated numberofknowngenesinthehumangenomeisadjusted(betweenJanuary2003andDecember 2003)andannotationpracticesaremodified(betweenDecember 2004andDecember2005,andbetween October2008andNovember2009),onecanarguethat,althoughthenumberofannotatedgenesandtheannotationsaredecreasing(whichismainlyduetotheadjustednumberofgenesinthehumangenomeandchanges intheannotationprocess),thequalityofannotationsisimproving,asdemonstratedbythesteadyincreaseinnon-IEAannotationsandthenumberofgeneswithnon-IEAannotations.However,theincreaseinthenumberofgeneswithnon-IEAannotationsisveryslow.Inalmost7years,between January2003andNovember2009,only2,039newgenesreceivednon-IEAannotations.Atthesametime,thenumberofnon-IEAannotationsincreasedfrom35,925to65,741,indicatingastrongresearchbiasforasmallnumberofgenes.doi:10.1371/journal.pcbi.1002375.g003
IncompleteandInaccurateAnnotation
• Additionally,manyoftheexistingannotationsareoflowqualityandmaybeinaccurate– >90%oftheannotationsintheOctober2015releaseofGOhadtheevidencecode“inferredfromelectronicannotations(IEA)”
– theonlyonesinGOthatarenotcuratedmanually– Annotationsinferredfromindirectevidenceareconsideredtobeoflowerqualitythanthosederivedfromdirectexperimentalevidence
– IftheannotationswithIEAcodeareremoved,thenumberofgeneswithgoodqualityannotationsintheNovember2015releaseofhumanGOannotationsisreducedfrom~18Kto~12K
IncompleteandInaccurateAnnotation
• ItisverylikelythatthereducednumberofannotationsandannotatedgenessinceJanuary2003isanindicatorofimprovingquality
• Thisisdueinparttothefactthatthenumberofgenesinagenomearecontinuouslybeingadjustedandthefunctionalannotationalgorithmsarebeingimproved– thenumberofnon-IEAannotationsiscontinuouslyincreasing
• However,therateofincreasefornon-IEAannotationsisveryslow(approximately2,000genesannotatedin7years)
IncompleteandInaccurateAnnotation
• Manualcuration oftheentiregenomeisexpectedtotakeaverylongtime(~13–25years)
• Entireresearchcommunitycouldparticipateinthecuration process
• OneapproachtofacilitateparticipationofalargenumberofresearchersistoadoptastandardannotationformatsimilartoMinimumInformationAboutaMicroarrayExperiment(MIAME)– shouldthisberequiredlikeGEO?
• Aformatforfunctionalannotationcanbedesignedoradoptedfromtheexistingformats(e.g.,BioPAX,SBML)– Suchaformatcouldallowresearcherstospecifyanexperimentally
confirmedroleofaspecifictranscriptoraSNPinapathwayalongwithexperimentalandbiologicalconditions
MissingConditionandcell-specificinformation
• Mostpathwayknowledgebasesarebuiltbycuratingexperimentsperformedindifferentcelltypesatdifferenttimepointsunderdifferentconditions
• Thesedetailsaretypicallynotavailableintheknowledgebases!
• Oneeffectofthisomissionisthatmultipleindependentgenesareannotatedtoparticipateinthesameinteractioninapathway
• Thiseffectissowidespreadthatmanypathwayknowledgebasesrepresentasetofdistinctgenesasasinglenodeinapathway
MissingConditionandcell-specificinformation
• Example:Wnt/beta-cateninpathwayinSTKE– thenodelabeled“Genes”represents19genesdirectlytargetedbyWnt indifferentorganisms(Xenopus andhuman)indifferentcellsandtissues(coloncarcinomacellsandepithelialcells
– thesenon-specificgenesintroducebiasforthesepathwaysinallexistinganalysisapproaches
– Forinstance,anyORAmethodwillassignhighersignificance(typicallyanorderofmagnitudelowerp-value)toapathwaywithmoregenes
– Similarly,moregenesinapathwayalsoincreasetheprobabilityofahigherpathway-levelstatisticinFCSapproaches,yieldinghighersignificanceforagivenpathway.
MissingConditionandcell-specificinformation
• Thiscontextualinformationistypicallynotavailablefrommostoftheexistingknowledgebases
• Astandardfunctionalannotationformatdiscussedabovewouldmakethisinformationavailabletocuratorsanddevelopers– Forinstance,therecentlyproposedBiologicalConnectionMarkupLanguage(BCML)allowspathwayrepresentationtospecifythecellororganisminwhicheachpathwayinteractionoccurs.
– BCMLcangeneratecell-,condition-,ororganism-specificpathwaysbasedonuser-definedquerycriteria,whichinturncanbeusedfortargetedanalysis
MissingConditionandcell-specificinformation
• Existingknowledgebasesdonotdescribetheeffectsofanabnormalconditiononapathway– Forexample,itisnotclearhowtheAlzheimer'sdiseasepathway
inKEGGdiffersfromanormalpathway– NoritisclearwhichsetofinteractionsleadstoAlzheimer's
disease
• Wearenowunderstandingthatcontextplaysanimportantroleinpathwayinteractions
• Informationabouthowcellandtissuetype,age,andenvironmentalexposuresaffectpathwayinteractionswilladdcomplexitythatiscurrentlylacking
MethodologicalChallenges
• Benchmarkdatasetsforcomparingdifferentmethods
• Inabilitytomodelandanalyzedynamicresponse
• Inabilitytomodeleffectsofanexternalstimuli
ComparingDifferentMethods
• Howdowecomparedifferentpathwayanalysismethods?
• Simulateddata– Advantages:
• Realsignalissimulated,so“true”answerisknown
– Disadvantages• Cannotcontainallthecomplexityofrealdata• Thesuccessofthemethodscanreflectthesimilarityofhowwellthesimulationmatchestheknowledgebasestructureused
ComparingDifferentMethods• Benchmarkdata– Advantages:
• Cancomparesensitivityandspecificity• Severaldatasetshavebeenconsistentlyusedintheliterature
• Includesallthecomplexityofrealbiologicaldata
– Disadvantages• Affectedbyconfoundingfactors
– absenceofapuredivision intoclasses– presenceofoutliers– ….
• Notrueanswerknownforgroundedcomparisons– actualbiologyisnt known
ComparingDifferentMethods• Ageneralchallenge:Differentdefinitionsofthesame
pathwayindifferentknowledgebasescanaffectperformanceassessment
– GOdefinesdifferentpathwaysforapoptosisindifferentcells• (e.g.,cardiacmusclecellapoptosis,Bcellapoptosis,Tcellapoptosis)• Furtherdistinguishes between inductionandregulationofapoptosis
– KEGGdefinesasinglesignalingpathwayforapoptosis• doesnotdistinguishbetween inductionandregulation
– AnapproachusingKEGGwouldidentifyasinglepathwayassignificant,whereasGOcouldidentifymultiplepathways,and/orspecificaspectsofasingleapoptosispathway
Inabilitytomodelandanalyzedynamicresponse
• Noexistingapproachcancollectivelymodelandanalyzehigh-throughputdataasasingledynamicsystem
• Currentapproachesanalyzeasnapshotassumingthateachpathwayisindependentoftheothersatagiventime– measureexpressionchangesatmultipletimepoints,and
analyzeeachtimepointindividually– Implicitlyassumesthatpathwaysatdifferenttimepointsare
independent
• Needmodelsthataccountsfordependenceamongpathwaysatdifferenttimepoints– Muchofthislimitationisduetotechnology/experimental
designà notallbioinformaticslimitations
Inabilitytomodeleffectsofanexternalstimuli
• Geneset–basedapproachesoftenonlyconsidergenesandtheirproducts
• Completelyignoretheeffectsofothermoleculesparticipatinginapathway– suchastheratelimitingstepofamulti-steppathway.
• Example:– Theamount/strengthofCa2+ causesdifferenttranscriptionfactorstobeactivated
– Thisinformationisusuallynotavailable.
Summary• Inthelastdecade,pathwayanalysishasmatured,andbecomethestandardfortryingtodissectthebiologyofhighthroughputexperiments.
• Manysimilaritiesacrossthethreemaingenerationsofpathwayanalysistools.
• Willdiscussmoredetailsofsomeofthesechoices,knowledgebases,andspecificapproachesnext.
• Manyopenmethodsdevelopmentchallenges!
OverviewofModule
• FirstHalf:– Overviewofgenesetandpathwayanalysis• Commonlyuseddatabasesandannotationissues• 1st and2nd generationtools
– Basicdifferencesinmethods– Detailsonverypopularmethods
• Issueswithdifferent“omics”datatypes
• SecondHalf– “3rd generation”methods– Networkanalysismodeling
Questions?