introduction to pathway and network analysis · pathway and network analysis • high-throughput...

63
Introduction to Pathway and Network Analysis Alison Motsinger-Reif, PhD Associate Professor Bioinformatics Research Center Department of Statistics North Carolina State University

Upload: lykhanh

Post on 10-Aug-2019

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

IntroductiontoPathwayandNetworkAnalysis

AlisonMotsinger-Reif,PhDAssociateProfessor

BioinformaticsResearchCenterDepartmentofStatistics

NorthCarolinaStateUniversity

Page 2: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayandNetworkAnalysis• High-throughputgenetic/genomictechnologiesenable

comprehensivemonitoringofabiologicalsystem

• Analysisofhigh-throughputdatatypicallyyieldsalistofdifferentiallyexpressedgenes,proteins,metabolites…– Typicallyprovideslistsofsinglegenes,etc.– Willuse“genes”throughout,butusinginterchangeablymostly

• Thislistoftenfailstoprovidemechanisticinsightsintotheunderlyingbiologyoftheconditionbeingstudied

• Howtoextractmeaningfromalonglistofdifferentiallyexpressedgenesà pathway/networkanalysis

Page 3: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Whatmakesanairplanefly?

Chas'StainlessSteel,MarkThompson'sAirplaneParts,About1000PoundsofStainlessSteelWire,andGagosian'sBeverlyHillsSpace

Page 4: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

FromcomponentstonetworksAbiologicalfunctionisaresultofmanyinteractingmoleculesandcannotbeattributedtojustasinglemolecule.

Page 5: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayandNetworkAnalysis• Oneapproach:simplifyanalysisbygroupinglonglistsofindividualgenesintosmallersetsofrelatedgenesreduces thecomplexityofanalysis.– alargenumberofknowledgebasesdevelopedtohelpwiththistask

• Knowledgebases– describebiologicalprocesses,components,orstructuresinwhichindividualgenes\areknowntobeinvolvedin

– howandwheregeneproductsinteractwitheachother

Page 6: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayandNetworkAnalysis

• Analysisatthefunctionallevelisappealingfortworeasons:– First,groupingthousandsofgenesbythepathwaystheyareinvolvedinreducesthecomplexitytojustseveralhundredpathwaysfortheexperiment

– Second,identifyingactivepathwaysthatdifferbetweentwoconditionscanhavemoreexplanatorypowerthanasimplelistofgenes

Page 7: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayandNetworkAnalysis

• Whatkindsofdataisusedforsuchanalysis?– Geneexpressiondata• Microarrays• RNA-seq

– Proteomicdata– Metabolomics data– Singlenucleotidepolymorphisms(SNPs)

– ….

Page 8: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayandNetworkAnalysis

• Whatkindsofquestionscanweask/answerwiththeseapproaches?

Page 9: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayandNetworkAnalysis• Theterm“pathwayanalysis”getsusedoften,andoftenindifferentways– appliedtotheanalysisofGeneOntology(GO)terms(alsoreferredtoasa“geneset”)

– physicalinteractionnetworks(e.g.,protein–proteininteractions)

– kineticsimulationofpathways– steady-statepathwayanalysis(e.g.,flux-balanceanalysis)– inferenceofpathwaysfromexpressionandsequencedata

• Mayormaynotactuallydescribebiologicalpathways

Page 10: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayandNetworkAnalysis

• Forthefirstpartofthismodule,wewillfocusonmethodsthatexploitpathwayknowledgeinpublicrepositoriesratherthanonmethodsthatinferpathwaysfrommolecularmeasurements– UserepositoriessuchasGOorKyotoEncyclopediaofGenesandGenomes(KEGG)

à knowledgebase–drivenpathwayanalysis

Page 11: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

AHistoryofPathwayAnalysisApproaches

• Overadecadeofdevelopmentofpathwayanalysisapproaches

• Canberoughly dividedintothreegenerations:– 1st:Over-RepresentationAnalysis(ORA)Approaches

– 2nd :FunctionalClassScoring(FCS)Approaches– 3rd :PathwayTopology(PT)-BasedApproaches

Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

Page 12: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

• Thedatageneratedbyanexperimentusingahigh-throughputtechnology(e.g.,microarray,proteomics,metabolomics),alongwithfunctionalannotations(pathwaydatabase)ofthecorrespondinggenome,areinputtovirtuallyallpathwayanalysismethods.

• ORAmethodsrequire thattheinputisalistofdifferentiallyexpressedgenes• FCSmethodsusetheentiredatamatrixasinput• PT-basedmethodsadditionallyutilize thenumberandtypeofinteractionsbetweengeneproducts,

whichmayormaynotbeapartofapathwaydatabase.• Theresultofeverypathwayanalysismethodisalistofsignificantpathwaysintheconditionunder

study.

Page 13: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Over-RepresentationAnalysis(ORA)Approaches

• Earliestmethodsà over-representationanalysis(ORA)

• Statisticallyevaluatesthefractionofgenesinaparticularpathwayfoundamongthesetofgenesshowingchangesinexpression

• Itisalsoreferredtoas“2×2tablemethod”intheliterature

Page 14: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Over-RepresentationAnalysis(ORA)• Usesoneormorevariationsofthefollowingstrategy:– First,aninputlistiscreatedusingacertainthresholdorcriteria• Forexample,maychoosegenesthataredifferentiallyover- orunder-expressedinagivenconditionatafalsediscoveryrate(FDR)of5%

– Then,foreachpathway,inputgenesthatarepartofthepathwayarecounted

– Thisprocessisrepeatedforanappropriatebackgroundlistofgenes• (e.g.,allgenesmeasuredonamicroarray)

– Next,everypathwayistestedforover- orunder-representationinthelistofinputgenes• Themostcommonlyusedtestsarebasedonthehypergeometric,chi-square,orbinomialdistribution

Page 15: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

Page 16: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LimitationsofORAApproaches• First,thedifferentstatisticsusedbyORAareindependent

ofthemeasuredchanges– (e.g.,hypergeometricdistribution,binomialdistribution,chi-

squaredistribution,etc.)

• Testsconsiderthenumberofgenesalonebutignoreanyvaluesassociatedwiththem– suchasprobeintensities

• Bydiscardingthisdata,ORAtreatseachgeneequally– Informationabouttheextentofregulation(e.g.,fold-changes,

significanceofachange,etc.)canbeusefulinassigningdifferentweightstoinputgenes/pathways

– Thiscanprovidemoreinformation

Page 17: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LimitationsofORAApproaches• Second,ORAtypicallyusesonlythemostsignificantgenesanddiscardstheothers– inputlistofgenesisusuallyobtainedusinganarbitrarythreshold(e.g.,geneswithfold-changeand/orp-values)

• Marginallylesssignificantgenesaremissed,resultingininformationloss– (e.g.,fold-change=1.999orp-value=0.051)– Afewmethodsavoidingthresholds

• Theyuseaniterativeapproachthataddsonegeneatatimetofindasetofgenesforwhichapathwayismostsignificant

Page 18: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LimitationsofORAApproaches• Third,ORAassumesthateachgeneisindependentoftheother

genes

• However,biologyisacomplexwebofinteractionsbetweengeneproductsthatconstitutedifferentpathways– Onegoalmightbetogaininsightsintohowinteractionsbetweengene

productsaremanifestedaschangesinexpression– Astrategythatassumesthegenesareindependentissignificantly

limitedinitsabilitytoprovideinsights

• Furthermore,assumingindependencebetweengenesamountsto“competitivenullhypothesis”testing(morelater),whichignoresthecorrelationstructurebetweengenes– theestimatedsignificanceofapathwaymaybebiasedorincorrect

Page 19: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LimitationsofORAApproaches• Fourth,ORAassumesthateachpathwayisindependentof

otherpathwaysà NOTTRUE!

• Examplesofdependence:– GOdefinesabiologicalprocessasaseriesofevents

accomplishedbyoneormoreorderedassembliesofmolecularfunctions

– ThecellcyclepathwayinKEGGwherethepresenceofagrowthfactoractivatestheMAPKsignalingpathway• This,inturn,activatesthecellcyclepathway

• NoORAmethodsaccountforthisdependencebetweenmolecularfunctionsinGOandsignalingpathwaysinKEGG

Page 20: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

FunctionalClassScoring(FCS)Approaches

• Thehypothesisoffunctionalclassscoring(FCS)isthatalthoughlargechangesinindividualgenescanhavesignificanteffectsonpathways,weakerbutcoordinatedchangesinsetsoffunctionallyrelatedgenes(i.e.,pathways)canalsohavesignificanteffects

• Withfewexceptions,allFCSmethodsuseavariationofageneralframeworkthatconsistsofthefollowingthreesteps.

Page 21: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Step1• First,agene-levelstatisticiscomputedusingthemolecularmeasurementsfromanexperiment– Involvescomputingdifferentialexpressionofindividualgenesorproteins

• Statisticscurrentlyusedatgene-levelincludecorrelationofmolecularmeasurementswithphenotype– ANOVA– Q-statistic– signal-to-noiseratio– t-test– Z-score

Page 22: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Step1• Choiceofagene-levelstatisticgenerallyhasanegligibleeffectontheidentificationofsignificantlyenrichedgenesets– However,whentherearefewbiologicalreplicates,aregularizedstatisticmaybebetter

• Untransformedgene-levelstatisticscanfailtoidentifypathwayswithup- anddown-regulatedgenes– Inthiscase,transformationofgene-levelstatistics(e.g.,absolutevalues,squaredvalues,ranks,etc.)isbetter

Page 23: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Step2• Second,thegene-levelstatisticsforallgenesinapathwayareaggregatedintoasinglepathway-levelstatistic– canbemultivariateandaccountforinterdependenciesamonggenes

– canbeunivariate anddisregardinterdependenciesamonggenes

• Thepathway-levelstatisticsusedinclude:– Kolmogorov-Smirnovstatistic– sum,mean,ormedianofgene-levelstatistic– Wilcoxon ranksum– maxmean statistic

Page 24: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Step2• Irrespectiveofitstype,thepowerofapathway-levelstatisticdependson– theproportionofdifferentiallyexpressedgenesinapathway

– thesizeofthepathway– theamountofcorrelationbetweengenesinthepathway

• Univariate statisticsshowmorepoweratstringentcutoffswhenappliedtorealbiologicaldata,andequalpowerasmultivariatestatisticsatlessstringentcutoffs

Page 25: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Step3• Assessingthestatisticalsignificanceofthepathway-levelstatistic

• Whencomputingstatisticalsignificance,thenullhypothesistestedbycurrentpathwayanalysisapproachescanbebroadlydividedintotwocategories:– i)competitivenullhypothesis– ii)self-containednullhypothesis

• Aself-containednullhypothesispermutesclasslabels(i.e.,phenotypes)foreachsampleandcomparesthesetofgenesinagivenpathwaywithitself,whileignoringthegenesthatarenotinthepathway

• Acompetitivenullhypothesispermutesgenelabelsforeachpathway,andcomparesthesetofgenesinthepathwaywithasetofgenesthatarenotinthepathway

Page 26: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

Page 27: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

AdvantagesofFCSMethodsFCSmethodsaddressthreelimitationsofORA

1. Don’trequireanarbitrarythresholdfordividingexpressiondataintosignificantandnon-significantpools.

Rather,FCSmethodsuseallavailablemolecularmeasurementsforpathwayanalysis.

2. WhileORAcompletelyignoresmolecularmeasurementswhenidentifyingsignificantpathways,FCSmethodsusethisinformationinordertodetectcoordinatedchangesintheexpressionofgenesinthesamepathway

3. Byconsideringthecoordinatedchangesingeneexpression,FCSmethodsaccountfordependencebetweengenesinapathway

Page 28: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LimitationsofFCSMethods• First,similartoORA,FCSanalyzeseachpathwayindependently– Becauseagenecanfunctioninmorethanonepathway,meaningthatpathwayscancrossandoverlap

– Consequently,inanexperiment,whileonepathwaymaybeaffectedinanexperiment,onemayobserveotherpathwaysbeingsignificantlyaffectedduetothesetofoverlappinggenes

• SuchaphenomenonisverycommonwhenusingtheGOtermstodefinepathwaysduetothehierarchicalnatureoftheGO

Page 29: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LimitationsofFCSMethods• Second,manyFCSmethodsusechangesingeneexpressiontorank

genesinagivenpathway,anddiscardthechangesfromfurtheranalysis– Forinstance,assumethattwogenesinapathway,AandB,are

changingby2-foldand20-fold,respectively– Aslongastheybothhavethesamerespectiveranksincomparison

withothergenesinthepathway,mostFCSmethodswilltreatthemequally,althoughthegenewiththehigherfold-changeshouldprobablygetmoreweight

• Importantly,however,consideringonlytheranksofgenesisalsoadvantageous,asitismorerobusttooutliers.– Anotableexceptiontothisscenarioisapproachesthatusegene-level

statistics(e.g.,t-statistic)tocomputepathway-levelscores.– Forexample,anFCSmethodthatcomputesapathway-levelstatisticas

asumormeanofthegene-levelstatisticaccountsforarelativedifferenceinmeasurements(e.g.,Category,SAFE).

Page 30: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayTopology(PT)-BasedApproaches

• Alargenumberofpubliclyavailablepathwayknowledgebasesprovideinformationbeyondsimplelistsofgenesforeachpathway– KEGG– MetaCyc– Reactome– RegulonDB– STKE– BioCarta– PantherDB– ….

• UnlikeGOandMSigDB,theseknowledgebasesalsoprovideinformationaboutgeneproductsthatinteractwitheachotherinagivenpathway,howtheyinteract(e.g.,activation,inhibition,etc.),andwheretheyinteract(e.g.,cytoplasm,nucleus,etc.)

Page 31: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayTopology(PT)-BasedApproaches

• ORAandFCSmethodsconsideronlythenumberofgenesinapathwayorgenecoexpression toidentifysignificantpathways,andignoretheadditionalinformationavailablefromtheseknowledgebases– Evenifthepathwaysarecompletelyredrawnwithnewlinks

betweenthegenes,aslongastheycontainthesamesetofgenes,ORAandFCSwillproducethesameresults

• Pathwaytopology(PT)-basedmethodshavebeendevelopedtousetheadditionalinformation– PT-basedmethodsareessentiallythesameasFCSmethodsin

thattheyperformthesamethreestepsasFCSmethods– Thekeydifferencebetweenthetwoistheuseofpathway

topologytocomputegene-levelstatistics

Page 32: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayTopology(PT)-BasedApproaches

• Rahnenfuhrer etal.proposedScorePAGE,whichcomputessimilaritybetweeneachpairofgenesinapathway(e.g.,correlation,covariance,etc.)– similaritymeasurementbetweeneachpairofgenesisanalogoustogene-levelstatisticsinFCSmethods

– averagedtocomputeapathway-levelscore

• Insteadofgivingequalweighttoallpairwisesimilarities,ScorePAGE dividesthepairwisesimilaritiesbythenumberofreactionsneededtoconnecttwogenesinagivenpathway

Page 33: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayTopology(PT)-BasedApproaches

• Impactfactor(IF)analysis– IFconsidersthestructureanddynamicsofanentirepathwayby

incorporatinganumberofimportantbiologicalfactors,includingchangesingeneexpression,typesofinteractions,andthepositionsofgenesinapathway

Aliwilltalkmoreabouttheseapproaches indetail!!!

Page 34: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

IFAnalysis

• Briefly…– Modelsasignalingpathwayasagraph,wherenodesrepresent

genesandedgesrepresentinteractionsbetweenthem– Definesagene-levelstatistic,calledperturbationfactor(PF)ofa

gene,asasumofitsmeasuredchangeinexpressionandalinearfunctionoftheperturbationfactorsofallgenesinapathway

– BecausethePFofeachgeneisdefinedbyalinearequation,theentirepathwayisdefinedasalinearsystem• addressesloopsinthepathways

– TheIFofapathway(pathway-levelstatistic)isdefinedasasumofPFofallgenesinapathway

Page 35: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayTopology(PT)-BasedApproaches

• FCSmethodsthatusecorrelationsamonggenesimplicitlyassumethattheunderlyingnetwork,asdefinedbythecorrelationstructure,doesnotchangeastheexperimentalconditionschange

• Thisassumptionmaybeinaccurateà PTapproachesimproveonthis

Page 36: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

PathwayTopology(PT)-BasedApproaches

• NetGSA accountsforthethechangeincorrelationaswellasthechangeinnetworkstructureasexperimentalconditionschange– likeIFanalysis,modelsgeneexpressionasalinearfunctionofothergenesinthenetwork

• itdiffersfromIFintwoaspects– First,itaccountsforagene'sbaselineexpressionbyrepresentingitasalatentvariableinthemodel

– Second,itrequiresthatthepathwaysberepresentedasdirectedacyclicgraphsDAGs• Ifapathwaycontainscycles,NetGSA requiresadditionallatentvariablesaffectingthenodesinthecycle.

• Incontrast,IFanalysisdoesnotimposeanyconstraintonthestructureofapathway

Page 37: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LimitationsofPT-basedApproaches

• Truepathwaytopologyisdependentonthetypeofcellduetocell-specificgeneexpressionprofilesandconditionbeingstudied– informationisrarelyavailable– fragmentedinknowledgebasesifavailable– Asannotationsimprove,theseapproachesareexpectedto

becomemoreuseful

• Inabilitytomodeldynamicstatesofasystem

• Inabilitytoconsiderinteractionsbetweenpathwaysduetoweakinter-pathwaylinkstoaccountforinterdependencebetweenpathways

Page 38: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Khatri P,Sirota M,ButteAJ.Tenyearsofpathwayanalysis:currentapproachesandoutstandingchallenges.PLoSComputBiol.2012;8(2):e1002375.

RRRRpackagenetgsa

Page 39: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

OutstandingChallenges

• BroadCategories:1. annotationchallenges2. methodologicalchallenges

Page 40: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

OutstandingChallenges

• Nextgenerationapproacheswillrequireimprovementoftheexistingannotations– necessarytocreateaccurate,highresolutionknowledgebaseswithdetailedcondition-,tissue-,andcell-specificfunctionsofeachgene• PharmGKB ….

– theseknowledgebaseswillallowinvestigatorstomodelanorganism'sbiologyasadynamicsystem,andwillhelppredictchangesinthesystemduetofactorssuchasmutationsorenvironmentalchanges

Page 41: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

AnnotationChallenges

• Lowresolutionknowledgebases• Incompleteandinaccurateannotations• Missingcondition- andcell-specificinformation

Page 42: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Greenarrowsrepresentabundantlyavailableinformation, andredarrowsrepresentmissing and/orincompleteinformation.Theultimategoalofpathwayanalysis istoanalyzeabiological systemasalarge,singlenetwork.However,thelinks betweensmallerindividual pathwaysarenotyetwellknown.Furthermore, theeffectsofaSNPonagivenpathwayarealsomissing fromcurrentknowledgebases.Whilesomepathwaysareknown toberelatedtoafewdiseases,itisnotclearwhetherthechangesinpathwaysarethecauseforthose diseases orthedownstreameffectsofthediseases.

Page 43: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LowResolutionKnowledgeBases• Knowledgebasesnotashighresolutionastechnologies– usingRNA-seq,morethan90%ofthehumangenomeisestimatedtobealternativelyspliced

– multipletranscriptsfromthesamegenemayhaverelated,distinct,orevenopposingfunctions

– GWAShaveidentifiedalargenumberofSNPsthatmaybeinvolvedindifferentconditionsanddiseases.

– However,currentknowledgebasesonlyspecifywhichgenesareactiveinagivenpathway

– Essentialthattheyalsobeginspecifyingotherinformation,suchastranscriptsthatareactiveinagivenpathwayorhowagivenSNPaffectsapathway

Page 44: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LowResolutionKnowledgeBases• Becauseoftheselowresolutionknowledgebases,every

availablepathwayanalysistoolfirstmapstheinputtoanon-redundantnamespace,typicallyanEntrez GeneID– thistypeofmappingisadvantageous,althoughitcanbenon-

trivial,asitallowstheexistingpathwayanalysisapproachestobeindependentofthetechnologyusedintheexperiment

– However,mappinginthiswayalsoresultsinthelossofimportantinformationthatmayhavebeenprovidedbecauseaspecifictechnologywasused• XRN2a,avariantofgeneXRN2,isexpressed inseveralhumantissues,whereasanothervariantofthesamegene,XRN2b,ismainlyexpressed inbloodleukocytes

• AlthoughRNA-seq canquantifyexpressionofbothvariants,mappingbothtranscriptstoasinglegenecauses lossoftissue-specificinformation,andpossiblyevencondition-specific information

Page 45: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

LowResolutionKnowledgeBases• Therefore,beforepathwayanalysiscanexploitcurrentandfuturetechnologicaladvancesinbiotechnology,itiscriticallyimportanttoannotateexacttranscriptsandSNPsthatparticipateinagivenpathway

• Whilenewapproachesarebeingdevelopedinthisregard,theymaynotyetbeadequate– Braunetal.proposedamethodforanalyzingSNPdatafromaGWAS

– StillreliesonmappingmultipleSNPstoasinglegene,followedbygene-to-pathwaymapping

Page 46: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

IncompleteandInaccurateAnnotation

• A surprisinglylargenumberofgenesarestillnotannotated

• Manyofthegenesarehypothetical,predicted,orpseudogenes– Althoughthenumberofprotein-codinggenesinthehumangenomeis

estimatedtobebetween20,000and25,000,accordingEntrezGene,thereare45,283humangenes,ofwhich14,162arepseudogenes

– Onecouldarguethatthepseudogenes shouldnotbeincludedwhenevaluatingfunctionalannotationcoverage

– pseudogene-derivedsmallinterferingRNAshavebeenshowntoregulategeneexpressioninmouseoocytes

– GOprovidesannotationsfor271pseudogenes– AwidelyusedDNAmicroarray,Affymetrix HGU133plus2.0,contains

1,026probesetsthatcorrespondto823pseudogenes– Shouldpseudogenes beincludedinthecountwhenestimating

annotationcoverageforthehumangenome?

Page 47: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

IncompleteandInaccurateAnnotationNumberofGO-annotatedgenes(leftpanel)andnumberofGOannotations(rightpanel)forhumanfromJanuary2003toNovember2009.Astheestimated numberofknowngenesinthehumangenomeisadjusted(betweenJanuary2003andDecember 2003)andannotationpracticesaremodified(betweenDecember 2004andDecember2005,andbetween October2008andNovember2009),onecanarguethat,althoughthenumberofannotatedgenesandtheannotationsaredecreasing(whichismainlyduetotheadjustednumberofgenesinthehumangenomeandchanges intheannotationprocess),thequalityofannotationsisimproving,asdemonstratedbythesteadyincreaseinnon-IEAannotationsandthenumberofgeneswithnon-IEAannotations.However,theincreaseinthenumberofgeneswithnon-IEAannotationsisveryslow.Inalmost7years,between January2003andNovember2009,only2,039newgenesreceivednon-IEAannotations.Atthesametime,thenumberofnon-IEAannotationsincreasedfrom35,925to65,741,indicatingastrongresearchbiasforasmallnumberofgenes.doi:10.1371/journal.pcbi.1002375.g003

Page 48: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

IncompleteandInaccurateAnnotation

• Additionally,manyoftheexistingannotationsareoflowqualityandmaybeinaccurate– >90%oftheannotationsintheOctober2015releaseofGOhadtheevidencecode“inferredfromelectronicannotations(IEA)”

– theonlyonesinGOthatarenotcuratedmanually– Annotationsinferredfromindirectevidenceareconsideredtobeoflowerqualitythanthosederivedfromdirectexperimentalevidence

– IftheannotationswithIEAcodeareremoved,thenumberofgeneswithgoodqualityannotationsintheNovember2015releaseofhumanGOannotationsisreducedfrom~18Kto~12K

Page 49: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

IncompleteandInaccurateAnnotation

• ItisverylikelythatthereducednumberofannotationsandannotatedgenessinceJanuary2003isanindicatorofimprovingquality

• Thisisdueinparttothefactthatthenumberofgenesinagenomearecontinuouslybeingadjustedandthefunctionalannotationalgorithmsarebeingimproved– thenumberofnon-IEAannotationsiscontinuouslyincreasing

• However,therateofincreasefornon-IEAannotationsisveryslow(approximately2,000genesannotatedin7years)

Page 50: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

IncompleteandInaccurateAnnotation

• Manualcuration oftheentiregenomeisexpectedtotakeaverylongtime(~13–25years)

• Entireresearchcommunitycouldparticipateinthecuration process

• OneapproachtofacilitateparticipationofalargenumberofresearchersistoadoptastandardannotationformatsimilartoMinimumInformationAboutaMicroarrayExperiment(MIAME)– shouldthisberequiredlikeGEO?

• Aformatforfunctionalannotationcanbedesignedoradoptedfromtheexistingformats(e.g.,BioPAX,SBML)– Suchaformatcouldallowresearcherstospecifyanexperimentally

confirmedroleofaspecifictranscriptoraSNPinapathwayalongwithexperimentalandbiologicalconditions

Page 51: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

MissingConditionandcell-specificinformation

• Mostpathwayknowledgebasesarebuiltbycuratingexperimentsperformedindifferentcelltypesatdifferenttimepointsunderdifferentconditions

• Thesedetailsaretypicallynotavailableintheknowledgebases!

• Oneeffectofthisomissionisthatmultipleindependentgenesareannotatedtoparticipateinthesameinteractioninapathway

• Thiseffectissowidespreadthatmanypathwayknowledgebasesrepresentasetofdistinctgenesasasinglenodeinapathway

Page 52: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

MissingConditionandcell-specificinformation

• Example:Wnt/beta-cateninpathwayinSTKE– thenodelabeled“Genes”represents19genesdirectlytargetedbyWnt indifferentorganisms(Xenopus andhuman)indifferentcellsandtissues(coloncarcinomacellsandepithelialcells

– thesenon-specificgenesintroducebiasforthesepathwaysinallexistinganalysisapproaches

– Forinstance,anyORAmethodwillassignhighersignificance(typicallyanorderofmagnitudelowerp-value)toapathwaywithmoregenes

– Similarly,moregenesinapathwayalsoincreasetheprobabilityofahigherpathway-levelstatisticinFCSapproaches,yieldinghighersignificanceforagivenpathway.

Page 53: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

MissingConditionandcell-specificinformation

• Thiscontextualinformationistypicallynotavailablefrommostoftheexistingknowledgebases

• Astandardfunctionalannotationformatdiscussedabovewouldmakethisinformationavailabletocuratorsanddevelopers– Forinstance,therecentlyproposedBiologicalConnectionMarkupLanguage(BCML)allowspathwayrepresentationtospecifythecellororganisminwhicheachpathwayinteractionoccurs.

– BCMLcangeneratecell-,condition-,ororganism-specificpathwaysbasedonuser-definedquerycriteria,whichinturncanbeusedfortargetedanalysis

Page 54: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

MissingConditionandcell-specificinformation

• Existingknowledgebasesdonotdescribetheeffectsofanabnormalconditiononapathway– Forexample,itisnotclearhowtheAlzheimer'sdiseasepathway

inKEGGdiffersfromanormalpathway– NoritisclearwhichsetofinteractionsleadstoAlzheimer's

disease

• Wearenowunderstandingthatcontextplaysanimportantroleinpathwayinteractions

• Informationabouthowcellandtissuetype,age,andenvironmentalexposuresaffectpathwayinteractionswilladdcomplexitythatiscurrentlylacking

Page 55: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

MethodologicalChallenges

• Benchmarkdatasetsforcomparingdifferentmethods

• Inabilitytomodelandanalyzedynamicresponse

• Inabilitytomodeleffectsofanexternalstimuli

Page 56: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

ComparingDifferentMethods

• Howdowecomparedifferentpathwayanalysismethods?

• Simulateddata– Advantages:

• Realsignalissimulated,so“true”answerisknown

– Disadvantages• Cannotcontainallthecomplexityofrealdata• Thesuccessofthemethodscanreflectthesimilarityofhowwellthesimulationmatchestheknowledgebasestructureused

Page 57: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

ComparingDifferentMethods• Benchmarkdata– Advantages:

• Cancomparesensitivityandspecificity• Severaldatasetshavebeenconsistentlyusedintheliterature

• Includesallthecomplexityofrealbiologicaldata

– Disadvantages• Affectedbyconfoundingfactors

– absenceofapuredivision intoclasses– presenceofoutliers– ….

• Notrueanswerknownforgroundedcomparisons– actualbiologyisnt known

Page 58: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

ComparingDifferentMethods• Ageneralchallenge:Differentdefinitionsofthesame

pathwayindifferentknowledgebasescanaffectperformanceassessment

– GOdefinesdifferentpathwaysforapoptosisindifferentcells• (e.g.,cardiacmusclecellapoptosis,Bcellapoptosis,Tcellapoptosis)• Furtherdistinguishes between inductionandregulationofapoptosis

– KEGGdefinesasinglesignalingpathwayforapoptosis• doesnotdistinguishbetween inductionandregulation

– AnapproachusingKEGGwouldidentifyasinglepathwayassignificant,whereasGOcouldidentifymultiplepathways,and/orspecificaspectsofasingleapoptosispathway

Page 59: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Inabilitytomodelandanalyzedynamicresponse

• Noexistingapproachcancollectivelymodelandanalyzehigh-throughputdataasasingledynamicsystem

• Currentapproachesanalyzeasnapshotassumingthateachpathwayisindependentoftheothersatagiventime– measureexpressionchangesatmultipletimepoints,and

analyzeeachtimepointindividually– Implicitlyassumesthatpathwaysatdifferenttimepointsare

independent

• Needmodelsthataccountsfordependenceamongpathwaysatdifferenttimepoints– Muchofthislimitationisduetotechnology/experimental

designà notallbioinformaticslimitations

Page 60: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Inabilitytomodeleffectsofanexternalstimuli

• Geneset–basedapproachesoftenonlyconsidergenesandtheirproducts

• Completelyignoretheeffectsofothermoleculesparticipatinginapathway– suchastheratelimitingstepofamulti-steppathway.

• Example:– Theamount/strengthofCa2+ causesdifferenttranscriptionfactorstobeactivated

– Thisinformationisusuallynotavailable.

Page 61: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Summary• Inthelastdecade,pathwayanalysishasmatured,andbecomethestandardfortryingtodissectthebiologyofhighthroughputexperiments.

• Manysimilaritiesacrossthethreemaingenerationsofpathwayanalysistools.

• Willdiscussmoredetailsofsomeofthesechoices,knowledgebases,andspecificapproachesnext.

• Manyopenmethodsdevelopmentchallenges!

Page 62: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

OverviewofModule

• FirstHalf:– Overviewofgenesetandpathwayanalysis• Commonlyuseddatabasesandannotationissues• 1st and2nd generationtools

– Basicdifferencesinmethods– Detailsonverypopularmethods

• Issueswithdifferent“omics”datatypes

• SecondHalf– “3rd generation”methods– Networkanalysismodeling

Page 63: Introduction to Pathway and Network Analysis · Pathway and Network Analysis • High-throughput genetic/genomic technologies enable comprehensive monitoring of a biological system

Questions?

[email protected]