introduction to pathway and network analysis · pathway and network analysis • high-throughput...

Post on 10-Aug-2019

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

IntroductiontoPathwayandNetworkAnalysis

AlisonMotsinger-Reif,PhDAssociateProfessor

BioinformaticsResearchCenterDepartmentofStatistics

NorthCarolinaStateUniversity

PathwayandNetworkAnalysis• High-throughputgenetic/genomictechnologiesenable

comprehensivemonitoringofabiologicalsystem

• Analysisofhigh-throughputdatatypicallyyieldsalistofdifferentiallyexpressedgenes,proteins,metabolites…– Typicallyprovideslistsofsinglegenes,etc.– Willuse“genes”throughout,butusinginterchangeablymostly

• Thislistoftenfailstoprovidemechanisticinsightsintotheunderlyingbiologyoftheconditionbeingstudied

• Howtoextractmeaningfromalonglistofdifferentiallyexpressedgenesà pathway/networkanalysis

Whatmakesanairplanefly?

Chas'StainlessSteel,MarkThompson'sAirplaneParts,About1000PoundsofStainlessSteelWire,andGagosian'sBeverlyHillsSpace

FromcomponentstonetworksAbiologicalfunctionisaresultofmanyinteractingmoleculesandcannotbeattributedtojustasinglemolecule.

PathwayandNetworkAnalysis• Oneapproach:simplifyanalysisbygroupinglonglistsofindividualgenesintosmallersetsofrelatedgenesreduces thecomplexityofanalysis.– alargenumberofknowledgebasesdevelopedtohelpwiththistask

• Knowledgebases– describebiologicalprocesses,components,orstructuresinwhichindividualgenes\areknowntobeinvolvedin

– howandwheregeneproductsinteractwitheachother

PathwayandNetworkAnalysis

• Analysisatthefunctionallevelisappealingfortworeasons:– First,groupingthousandsofgenesbythepathwaystheyareinvolvedinreducesthecomplexitytojustseveralhundredpathwaysfortheexperiment

– Second,identifyingactivepathwaysthatdifferbetweentwoconditionscanhavemoreexplanatorypowerthanasimplelistofgenes

PathwayandNetworkAnalysis

• Whatkindsofdataisusedforsuchanalysis?– Geneexpressiondata• Microarrays• RNA-seq

– Proteomicdata– Metabolomics data– Singlenucleotidepolymorphisms(SNPs)

– ….

PathwayandNetworkAnalysis

• Whatkindsofquestionscanweask/answerwiththeseapproaches?

PathwayandNetworkAnalysis• Theterm“pathwayanalysis”getsusedoften,andoftenindifferentways– appliedtotheanalysisofGeneOntology(GO)terms(alsoreferredtoasa“geneset”)

– physicalinteractionnetworks(e.g.,protein–proteininteractions)

– kineticsimulationofpathways– steady-statepathwayanalysis(e.g.,flux-balanceanalysis)– inferenceofpathwaysfromexpressionandsequencedata

• Mayormaynotactuallydescribebiologicalpathways

PathwayandNetworkAnalysis

• Forthefirstpartofthismodule,wewillfocusonmethodsthatexploitpathwayknowledgeinpublicrepositoriesratherthanonmethodsthatinferpathwaysfrommolecularmeasurements– UserepositoriessuchasGOorKyotoEncyclopediaofGenesandGenomes(KEGG)

à knowledgebase–drivenpathwayanalysis

AHistoryofPathwayAnalysisApproaches

• Overadecadeofdevelopmentofpathwayanalysisapproaches

• Canberoughly dividedintothreegenerations:– 1st:Over-RepresentationAnalysis(ORA)Approaches

– 2nd :FunctionalClassScoring(FCS)Approaches– 3rd :PathwayTopology(PT)-BasedApproaches

Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

• Thedatageneratedbyanexperimentusingahigh-throughputtechnology(e.g.,microarray,proteomics,metabolomics),alongwithfunctionalannotations(pathwaydatabase)ofthecorrespondinggenome,areinputtovirtuallyallpathwayanalysismethods.

• ORAmethodsrequire thattheinputisalistofdifferentiallyexpressedgenes• FCSmethodsusetheentiredatamatrixasinput• PT-basedmethodsadditionallyutilize thenumberandtypeofinteractionsbetweengeneproducts,

whichmayormaynotbeapartofapathwaydatabase.• Theresultofeverypathwayanalysismethodisalistofsignificantpathwaysintheconditionunder

study.

Over-RepresentationAnalysis(ORA)Approaches

• Earliestmethodsà over-representationanalysis(ORA)

• Statisticallyevaluatesthefractionofgenesinaparticularpathwayfoundamongthesetofgenesshowingchangesinexpression

• Itisalsoreferredtoas“2×2tablemethod”intheliterature

Over-RepresentationAnalysis(ORA)• Usesoneormorevariationsofthefollowingstrategy:– First,aninputlistiscreatedusingacertainthresholdorcriteria• Forexample,maychoosegenesthataredifferentiallyover- orunder-expressedinagivenconditionatafalsediscoveryrate(FDR)of5%

– Then,foreachpathway,inputgenesthatarepartofthepathwayarecounted

– Thisprocessisrepeatedforanappropriatebackgroundlistofgenes• (e.g.,allgenesmeasuredonamicroarray)

– Next,everypathwayistestedforover- orunder-representationinthelistofinputgenes• Themostcommonlyusedtestsarebasedonthehypergeometric,chi-square,orbinomialdistribution

Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

LimitationsofORAApproaches• First,thedifferentstatisticsusedbyORAareindependent

ofthemeasuredchanges– (e.g.,hypergeometricdistribution,binomialdistribution,chi-

squaredistribution,etc.)

• Testsconsiderthenumberofgenesalonebutignoreanyvaluesassociatedwiththem– suchasprobeintensities

• Bydiscardingthisdata,ORAtreatseachgeneequally– Informationabouttheextentofregulation(e.g.,fold-changes,

significanceofachange,etc.)canbeusefulinassigningdifferentweightstoinputgenes/pathways

– Thiscanprovidemoreinformation

LimitationsofORAApproaches• Second,ORAtypicallyusesonlythemostsignificantgenesanddiscardstheothers– inputlistofgenesisusuallyobtainedusinganarbitrarythreshold(e.g.,geneswithfold-changeand/orp-values)

• Marginallylesssignificantgenesaremissed,resultingininformationloss– (e.g.,fold-change=1.999orp-value=0.051)– Afewmethodsavoidingthresholds

• Theyuseaniterativeapproachthataddsonegeneatatimetofindasetofgenesforwhichapathwayismostsignificant

LimitationsofORAApproaches• Third,ORAassumesthateachgeneisindependentoftheother

genes

• However,biologyisacomplexwebofinteractionsbetweengeneproductsthatconstitutedifferentpathways– Onegoalmightbetogaininsightsintohowinteractionsbetweengene

productsaremanifestedaschangesinexpression– Astrategythatassumesthegenesareindependentissignificantly

limitedinitsabilitytoprovideinsights

• Furthermore,assumingindependencebetweengenesamountsto“competitivenullhypothesis”testing(morelater),whichignoresthecorrelationstructurebetweengenes– theestimatedsignificanceofapathwaymaybebiasedorincorrect

LimitationsofORAApproaches• Fourth,ORAassumesthateachpathwayisindependentof

otherpathwaysà NOTTRUE!

• Examplesofdependence:– GOdefinesabiologicalprocessasaseriesofevents

accomplishedbyoneormoreorderedassembliesofmolecularfunctions

– ThecellcyclepathwayinKEGGwherethepresenceofagrowthfactoractivatestheMAPKsignalingpathway• This,inturn,activatesthecellcyclepathway

• NoORAmethodsaccountforthisdependencebetweenmolecularfunctionsinGOandsignalingpathwaysinKEGG

FunctionalClassScoring(FCS)Approaches

• Thehypothesisoffunctionalclassscoring(FCS)isthatalthoughlargechangesinindividualgenescanhavesignificanteffectsonpathways,weakerbutcoordinatedchangesinsetsoffunctionallyrelatedgenes(i.e.,pathways)canalsohavesignificanteffects

• Withfewexceptions,allFCSmethodsuseavariationofageneralframeworkthatconsistsofthefollowingthreesteps.

Step1• First,agene-levelstatisticiscomputedusingthemolecularmeasurementsfromanexperiment– Involvescomputingdifferentialexpressionofindividualgenesorproteins

• Statisticscurrentlyusedatgene-levelincludecorrelationofmolecularmeasurementswithphenotype– ANOVA– Q-statistic– signal-to-noiseratio– t-test– Z-score

Step1• Choiceofagene-levelstatisticgenerallyhasanegligibleeffectontheidentificationofsignificantlyenrichedgenesets– However,whentherearefewbiologicalreplicates,aregularizedstatisticmaybebetter

• Untransformedgene-levelstatisticscanfailtoidentifypathwayswithup- anddown-regulatedgenes– Inthiscase,transformationofgene-levelstatistics(e.g.,absolutevalues,squaredvalues,ranks,etc.)isbetter

Step2• Second,thegene-levelstatisticsforallgenesinapathwayareaggregatedintoasinglepathway-levelstatistic– canbemultivariateandaccountforinterdependenciesamonggenes

– canbeunivariate anddisregardinterdependenciesamonggenes

• Thepathway-levelstatisticsusedinclude:– Kolmogorov-Smirnovstatistic– sum,mean,ormedianofgene-levelstatistic– Wilcoxon ranksum– maxmean statistic

Step2• Irrespectiveofitstype,thepowerofapathway-levelstatisticdependson– theproportionofdifferentiallyexpressedgenesinapathway

– thesizeofthepathway– theamountofcorrelationbetweengenesinthepathway

• Univariate statisticsshowmorepoweratstringentcutoffswhenappliedtorealbiologicaldata,andequalpowerasmultivariatestatisticsatlessstringentcutoffs

Step3• Assessingthestatisticalsignificanceofthepathway-levelstatistic

• Whencomputingstatisticalsignificance,thenullhypothesistestedbycurrentpathwayanalysisapproachescanbebroadlydividedintotwocategories:– i)competitivenullhypothesis– ii)self-containednullhypothesis

• Aself-containednullhypothesispermutesclasslabels(i.e.,phenotypes)foreachsampleandcomparesthesetofgenesinagivenpathwaywithitself,whileignoringthegenesthatarenotinthepathway

• Acompetitivenullhypothesispermutesgenelabelsforeachpathway,andcomparesthesetofgenesinthepathwaywithasetofgenesthatarenotinthepathway

Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

AdvantagesofFCSMethodsFCSmethodsaddressthreelimitationsofORA

1. Don’trequireanarbitrarythresholdfordividingexpressiondataintosignificantandnon-significantpools.

Rather,FCSmethodsuseallavailablemolecularmeasurementsforpathwayanalysis.

2. WhileORAcompletelyignoresmolecularmeasurementswhenidentifyingsignificantpathways,FCSmethodsusethisinformationinordertodetectcoordinatedchangesintheexpressionofgenesinthesamepathway

3. Byconsideringthecoordinatedchangesingeneexpression,FCSmethodsaccountfordependencebetweengenesinapathway

LimitationsofFCSMethods• First,similartoORA,FCSanalyzeseachpathwayindependently– Becauseagenecanfunctioninmorethanonepathway,meaningthatpathwayscancrossandoverlap

– Consequently,inanexperiment,whileonepathwaymaybeaffectedinanexperiment,onemayobserveotherpathwaysbeingsignificantlyaffectedduetothesetofoverlappinggenes

• SuchaphenomenonisverycommonwhenusingtheGOtermstodefinepathwaysduetothehierarchicalnatureoftheGO

LimitationsofFCSMethods• Second,manyFCSmethodsusechangesingeneexpressiontorank

genesinagivenpathway,anddiscardthechangesfromfurtheranalysis– Forinstance,assumethattwogenesinapathway,AandB,are

changingby2-foldand20-fold,respectively– Aslongastheybothhavethesamerespectiveranksincomparison

withothergenesinthepathway,mostFCSmethodswilltreatthemequally,althoughthegenewiththehigherfold-changeshouldprobablygetmoreweight

• Importantly,however,consideringonlytheranksofgenesisalsoadvantageous,asitismorerobusttooutliers.– Anotableexceptiontothisscenarioisapproachesthatusegene-level

statistics(e.g.,t-statistic)tocomputepathway-levelscores.– Forexample,anFCSmethodthatcomputesapathway-levelstatisticas

asumormeanofthegene-levelstatisticaccountsforarelativedifferenceinmeasurements(e.g.,Category,SAFE).

PathwayTopology(PT)-BasedApproaches

• Alargenumberofpubliclyavailablepathwayknowledgebasesprovideinformationbeyondsimplelistsofgenesforeachpathway– KEGG– MetaCyc– Reactome– RegulonDB– STKE– BioCarta– PantherDB– ….

• UnlikeGOandMSigDB,theseknowledgebasesalsoprovideinformationaboutgeneproductsthatinteractwitheachotherinagivenpathway,howtheyinteract(e.g.,activation,inhibition,etc.),andwheretheyinteract(e.g.,cytoplasm,nucleus,etc.)

PathwayTopology(PT)-BasedApproaches

• ORAandFCSmethodsconsideronlythenumberofgenesinapathwayorgenecoexpression toidentifysignificantpathways,andignoretheadditionalinformationavailablefromtheseknowledgebases– Evenifthepathwaysarecompletelyredrawnwithnewlinks

betweenthegenes,aslongastheycontainthesamesetofgenes,ORAandFCSwillproducethesameresults

• Pathwaytopology(PT)-basedmethodshavebeendevelopedtousetheadditionalinformation– PT-basedmethodsareessentiallythesameasFCSmethodsin

thattheyperformthesamethreestepsasFCSmethods– Thekeydifferencebetweenthetwoistheuseofpathway

topologytocomputegene-levelstatistics

PathwayTopology(PT)-BasedApproaches

• Rahnenfuhrer etal.proposedScorePAGE,whichcomputessimilaritybetweeneachpairofgenesinapathway(e.g.,correlation,covariance,etc.)– similaritymeasurementbetweeneachpairofgenesisanalogoustogene-levelstatisticsinFCSmethods

– averagedtocomputeapathway-levelscore

• Insteadofgivingequalweighttoallpairwisesimilarities,ScorePAGE dividesthepairwisesimilaritiesbythenumberofreactionsneededtoconnecttwogenesinagivenpathway

PathwayTopology(PT)-BasedApproaches

• Impactfactor(IF)analysis– IFconsidersthestructureanddynamicsofanentirepathwayby

incorporatinganumberofimportantbiologicalfactors,includingchangesingeneexpression,typesofinteractions,andthepositionsofgenesinapathway

Aliwilltalkmoreabouttheseapproaches indetail!!!

IFAnalysis

• Briefly…– Modelsasignalingpathwayasagraph,wherenodesrepresent

genesandedgesrepresentinteractionsbetweenthem– Definesagene-levelstatistic,calledperturbationfactor(PF)ofa

gene,asasumofitsmeasuredchangeinexpressionandalinearfunctionoftheperturbationfactorsofallgenesinapathway

– BecausethePFofeachgeneisdefinedbyalinearequation,theentirepathwayisdefinedasalinearsystem• addressesloopsinthepathways

– TheIFofapathway(pathway-levelstatistic)isdefinedasasumofPFofallgenesinapathway

PathwayTopology(PT)-BasedApproaches

• FCSmethodsthatusecorrelationsamonggenesimplicitlyassumethattheunderlyingnetwork,asdefinedbythecorrelationstructure,doesnotchangeastheexperimentalconditionschange

• Thisassumptionmaybeinaccurateà PTapproachesimproveonthis

PathwayTopology(PT)-BasedApproaches

• NetGSA accountsforthethechangeincorrelationaswellasthechangeinnetworkstructureasexperimentalconditionschange– likeIFanalysis,modelsgeneexpressionasalinearfunctionofothergenesinthenetwork

• itdiffersfromIFintwoaspects– First,itaccountsforagene'sbaselineexpressionbyrepresentingitasalatentvariableinthemodel

– Second,itrequiresthatthepathwaysberepresentedasdirectedacyclicgraphsDAGs• Ifapathwaycontainscycles,NetGSA requiresadditionallatentvariablesaffectingthenodesinthecycle.

• Incontrast,IFanalysisdoesnotimposeanyconstraintonthestructureofapathway

LimitationsofPT-basedApproaches

• Truepathwaytopologyisdependentonthetypeofcellduetocell-specificgeneexpressionprofilesandconditionbeingstudied– informationisrarelyavailable– fragmentedinknowledgebasesifavailable– Asannotationsimprove,theseapproachesareexpectedto

becomemoreuseful

• Inabilitytomodeldynamicstatesofasystem

• Inabilitytoconsiderinteractionsbetweenpathwaysduetoweakinter-pathwaylinkstoaccountforinterdependencebetweenpathways

Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

RRRRpackagenetgsa

OutstandingChallenges

• BroadCategories:1. annotationchallenges2. methodologicalchallenges

OutstandingChallenges

• Nextgenerationapproacheswillrequireimprovementoftheexistingannotations– necessarytocreateaccurate,highresolutionknowledgebaseswithdetailedcondition-,tissue-,andcell-specificfunctionsofeachgene• PharmGKB ….

– theseknowledgebaseswillallowinvestigatorstomodelanorganism'sbiologyasadynamicsystem,andwillhelppredictchangesinthesystemduetofactorssuchasmutationsorenvironmentalchanges

AnnotationChallenges

• Lowresolutionknowledgebases• Incompleteandinaccurateannotations• Missingcondition- andcell-specificinformation

Greenarrowsrepresentabundantlyavailableinformation, andredarrowsrepresentmissing and/orincompleteinformation.Theultimategoalofpathwayanalysis istoanalyzeabiological systemasalarge,singlenetwork.However,thelinks betweensmallerindividual pathwaysarenotyetwellknown.Furthermore, theeffectsofaSNPonagivenpathwayarealsomissing fromcurrentknowledgebases.Whilesomepathwaysareknown toberelatedtoafewdiseases,itisnotclearwhetherthechangesinpathwaysarethecauseforthose diseases orthedownstreameffectsofthediseases.

LowResolutionKnowledgeBases• Knowledgebasesnotashighresolutionastechnologies– usingRNA-seq,morethan90%ofthehumangenomeisestimatedtobealternativelyspliced

– multipletranscriptsfromthesamegenemayhaverelated,distinct,orevenopposingfunctions

– GWAShaveidentifiedalargenumberofSNPsthatmaybeinvolvedindifferentconditionsanddiseases.

– However,currentknowledgebasesonlyspecifywhichgenesareactiveinagivenpathway

– Essentialthattheyalsobeginspecifyingotherinformation,suchastranscriptsthatareactiveinagivenpathwayorhowagivenSNPaffectsapathway

LowResolutionKnowledgeBases• Becauseoftheselowresolutionknowledgebases,every

availablepathwayanalysistoolfirstmapstheinputtoanon-redundantnamespace,typicallyanEntrez GeneID– thistypeofmappingisadvantageous,althoughitcanbenon-

trivial,asitallowstheexistingpathwayanalysisapproachestobeindependentofthetechnologyusedintheexperiment

– However,mappinginthiswayalsoresultsinthelossofimportantinformationthatmayhavebeenprovidedbecauseaspecifictechnologywasused• XRN2a,avariantofgeneXRN2,isexpressed inseveralhumantissues,whereasanothervariantofthesamegene,XRN2b,ismainlyexpressed inbloodleukocytes

• AlthoughRNA-seq canquantifyexpressionofbothvariants,mappingbothtranscriptstoasinglegenecauses lossoftissue-specificinformation,andpossiblyevencondition-specific information

LowResolutionKnowledgeBases• Therefore,beforepathwayanalysiscanexploitcurrentandfuturetechnologicaladvancesinbiotechnology,itiscriticallyimportanttoannotateexacttranscriptsandSNPsthatparticipateinagivenpathway

• Whilenewapproachesarebeingdevelopedinthisregard,theymaynotyetbeadequate– Braunetal.proposedamethodforanalyzingSNPdatafromaGWAS

– StillreliesonmappingmultipleSNPstoasinglegene,followedbygene-to-pathwaymapping

IncompleteandInaccurateAnnotation

• A surprisinglylargenumberofgenesarestillnotannotated

• Manyofthegenesarehypothetical,predicted,orpseudogenes– Althoughthenumberofprotein-codinggenesinthehumangenomeis

estimatedtobebetween20,000and25,000,accordingEntrezGene,thereare45,283humangenes,ofwhich14,162arepseudogenes

– Onecouldarguethatthepseudogenes shouldnotbeincludedwhenevaluatingfunctionalannotationcoverage

– pseudogene-derivedsmallinterferingRNAshavebeenshowntoregulategeneexpressioninmouseoocytes

– GOprovidesannotationsfor271pseudogenes– AwidelyusedDNAmicroarray,Affymetrix HGU133plus2.0,contains

1,026probesetsthatcorrespondto823pseudogenes– Shouldpseudogenes beincludedinthecountwhenestimating

annotationcoverageforthehumangenome?

IncompleteandInaccurateAnnotationNumberofGO-annotatedgenes(leftpanel)andnumberofGOannotations(rightpanel)forhumanfromJanuary2003toNovember2009.Astheestimated numberofknowngenesinthehumangenomeisadjusted(betweenJanuary2003andDecember 2003)andannotationpracticesaremodified(betweenDecember 2004andDecember2005,andbetween October2008andNovember2009),onecanarguethat,althoughthenumberofannotatedgenesandtheannotationsaredecreasing(whichismainlyduetotheadjustednumberofgenesinthehumangenomeandchanges intheannotationprocess),thequalityofannotationsisimproving,asdemonstratedbythesteadyincreaseinnon-IEAannotationsandthenumberofgeneswithnon-IEAannotations.However,theincreaseinthenumberofgeneswithnon-IEAannotationsisveryslow.Inalmost7years,between January2003andNovember2009,only2,039newgenesreceivednon-IEAannotations.Atthesametime,thenumberofnon-IEAannotationsincreasedfrom35,925to65,741,indicatingastrongresearchbiasforasmallnumberofgenes.doi:10.1371/journal.pcbi.1002375.g003

IncompleteandInaccurateAnnotation

• Additionally,manyoftheexistingannotationsareoflowqualityandmaybeinaccurate– >90%oftheannotationsintheOctober2015releaseofGOhadtheevidencecode“inferredfromelectronicannotations(IEA)”

– theonlyonesinGOthatarenotcuratedmanually– Annotationsinferredfromindirectevidenceareconsideredtobeoflowerqualitythanthosederivedfromdirectexperimentalevidence

– IftheannotationswithIEAcodeareremoved,thenumberofgeneswithgoodqualityannotationsintheNovember2015releaseofhumanGOannotationsisreducedfrom~18Kto~12K

IncompleteandInaccurateAnnotation

• ItisverylikelythatthereducednumberofannotationsandannotatedgenessinceJanuary2003isanindicatorofimprovingquality

• Thisisdueinparttothefactthatthenumberofgenesinagenomearecontinuouslybeingadjustedandthefunctionalannotationalgorithmsarebeingimproved– thenumberofnon-IEAannotationsiscontinuouslyincreasing

• However,therateofincreasefornon-IEAannotationsisveryslow(approximately2,000genesannotatedin7years)

IncompleteandInaccurateAnnotation

• Manualcuration oftheentiregenomeisexpectedtotakeaverylongtime(~13–25years)

• Entireresearchcommunitycouldparticipateinthecuration process

• OneapproachtofacilitateparticipationofalargenumberofresearchersistoadoptastandardannotationformatsimilartoMinimumInformationAboutaMicroarrayExperiment(MIAME)– shouldthisberequiredlikeGEO?

• Aformatforfunctionalannotationcanbedesignedoradoptedfromtheexistingformats(e.g.,BioPAX,SBML)– Suchaformatcouldallowresearcherstospecifyanexperimentally

confirmedroleofaspecifictranscriptoraSNPinapathwayalongwithexperimentalandbiologicalconditions

MissingConditionandcell-specificinformation

• Mostpathwayknowledgebasesarebuiltbycuratingexperimentsperformedindifferentcelltypesatdifferenttimepointsunderdifferentconditions

• Thesedetailsaretypicallynotavailableintheknowledgebases!

• Oneeffectofthisomissionisthatmultipleindependentgenesareannotatedtoparticipateinthesameinteractioninapathway

• Thiseffectissowidespreadthatmanypathwayknowledgebasesrepresentasetofdistinctgenesasasinglenodeinapathway

MissingConditionandcell-specificinformation

• Example:Wnt/beta-cateninpathwayinSTKE– thenodelabeled“Genes”represents19genesdirectlytargetedbyWnt indifferentorganisms(Xenopus andhuman)indifferentcellsandtissues(coloncarcinomacellsandepithelialcells

– thesenon-specificgenesintroducebiasforthesepathwaysinallexistinganalysisapproaches

– Forinstance,anyORAmethodwillassignhighersignificance(typicallyanorderofmagnitudelowerp-value)toapathwaywithmoregenes

– Similarly,moregenesinapathwayalsoincreasetheprobabilityofahigherpathway-levelstatisticinFCSapproaches,yieldinghighersignificanceforagivenpathway.

MissingConditionandcell-specificinformation

• Thiscontextualinformationistypicallynotavailablefrommostoftheexistingknowledgebases

• Astandardfunctionalannotationformatdiscussedabovewouldmakethisinformationavailabletocuratorsanddevelopers– Forinstance,therecentlyproposedBiologicalConnectionMarkupLanguage(BCML)allowspathwayrepresentationtospecifythecellororganisminwhicheachpathwayinteractionoccurs.

– BCMLcangeneratecell-,condition-,ororganism-specificpathwaysbasedonuser-definedquerycriteria,whichinturncanbeusedfortargetedanalysis

MissingConditionandcell-specificinformation

• Existingknowledgebasesdonotdescribetheeffectsofanabnormalconditiononapathway– Forexample,itisnotclearhowtheAlzheimer'sdiseasepathway

inKEGGdiffersfromanormalpathway– NoritisclearwhichsetofinteractionsleadstoAlzheimer's

disease

• Wearenowunderstandingthatcontextplaysanimportantroleinpathwayinteractions

• Informationabouthowcellandtissuetype,age,andenvironmentalexposuresaffectpathwayinteractionswilladdcomplexitythatiscurrentlylacking

MethodologicalChallenges

• Benchmarkdatasetsforcomparingdifferentmethods

• Inabilitytomodelandanalyzedynamicresponse

• Inabilitytomodeleffectsofanexternalstimuli

ComparingDifferentMethods

• Howdowecomparedifferentpathwayanalysismethods?

• Simulateddata– Advantages:

• Realsignalissimulated,so“true”answerisknown

– Disadvantages• Cannotcontainallthecomplexityofrealdata• Thesuccessofthemethodscanreflectthesimilarityofhowwellthesimulationmatchestheknowledgebasestructureused

ComparingDifferentMethods• Benchmarkdata– Advantages:

• Cancomparesensitivityandspecificity• Severaldatasetshavebeenconsistentlyusedintheliterature

• Includesallthecomplexityofrealbiologicaldata

– Disadvantages• Affectedbyconfoundingfactors

– absenceofapuredivision intoclasses– presenceofoutliers– ….

• Notrueanswerknownforgroundedcomparisons– actualbiologyisnt known

ComparingDifferentMethods• Ageneralchallenge:Differentdefinitionsofthesame

pathwayindifferentknowledgebasescanaffectperformanceassessment

– GOdefinesdifferentpathwaysforapoptosisindifferentcells• (e.g.,cardiacmusclecellapoptosis,Bcellapoptosis,Tcellapoptosis)• Furtherdistinguishes between inductionandregulationofapoptosis

– KEGGdefinesasinglesignalingpathwayforapoptosis• doesnotdistinguishbetween inductionandregulation

– AnapproachusingKEGGwouldidentifyasinglepathwayassignificant,whereasGOcouldidentifymultiplepathways,and/orspecificaspectsofasingleapoptosispathway

Inabilitytomodelandanalyzedynamicresponse

• Noexistingapproachcancollectivelymodelandanalyzehigh-throughputdataasasingledynamicsystem

• Currentapproachesanalyzeasnapshotassumingthateachpathwayisindependentoftheothersatagiventime– measureexpressionchangesatmultipletimepoints,and

analyzeeachtimepointindividually– Implicitlyassumesthatpathwaysatdifferenttimepointsare

independent

• Needmodelsthataccountsfordependenceamongpathwaysatdifferenttimepoints– Muchofthislimitationisduetotechnology/experimental

designà notallbioinformaticslimitations

Inabilitytomodeleffectsofanexternalstimuli

• Geneset–basedapproachesoftenonlyconsidergenesandtheirproducts

• Completelyignoretheeffectsofothermoleculesparticipatinginapathway– suchastheratelimitingstepofamulti-steppathway.

• Example:– Theamount/strengthofCa2+ causesdifferenttranscriptionfactorstobeactivated

– Thisinformationisusuallynotavailable.

Summary• Inthelastdecade,pathwayanalysishasmatured,andbecomethestandardfortryingtodissectthebiologyofhighthroughputexperiments.

• Manysimilaritiesacrossthethreemaingenerationsofpathwayanalysistools.

• Willdiscussmoredetailsofsomeofthesechoices,knowledgebases,andspecificapproachesnext.

• Manyopenmethodsdevelopmentchallenges!

OverviewofModule

• FirstHalf:– Overviewofgenesetandpathwayanalysis• Commonlyuseddatabasesandannotationissues• 1st and2nd generationtools

– Basicdifferencesinmethods– Detailsonverypopularmethods

• Issueswithdifferent“omics”datatypes

• SecondHalf– “3rd generation”methods– Networkanalysismodeling

Questions?

motsinger@stat.ncsu.edu

top related