genome and proteome annotation using automatically recognized concepts and functional networks

Upload: amia

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    1/21

    Genomeandproteomeannotationusingautomaticallyrecognizedconceptsand

    functionalnetworks

    AdrianBivol,TobiasWittkop,DarcyDavis,andSeanMooney

    Mooneylaboratory,BuckInstituteforResearchonAging,Novato,CA

    NationalCenterforBiomedicalOntology,StanfordUniversity,Stanford,CA

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    2/21

    Genefunction/diseaseprediction

    Typically uses Gene Ontology (GO) or disease annotation (e.g. OMIM)

    Many tools utilize similar set of features/networks, e.g. PPI networks,

    co-expression networks, sequence similarity,...

    Input: Set of genes with known function/disease

    Output: ranked list of remaining genes (closest at the top)

    Can these tools be used for other annotations then GO or disease?

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    3/21

    Systematicevaluationofautomated

    annotations

    1.Annotate all (human) genesto terms from ontologiesoutside GO and OMIM, e.g. Phenotype Ontology,CHEBI, or Pathway Ontology.

    2.For each term (gene set) evaluate predictability,i.e. how well can we predict the genes that areannotated to it using existing gene function predictionmethods.

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    4/21

    GeneannotationsoutsideofGO

    NCBO currently includes over 250 ontologies

    Ontologies are structured controlled vocabularies

    Gene/protein summary in Entrez Gene and UniProt often moreup-to-date than manually curated GO

    NCBO provides annotator service1 that matches text to terms

    1C.Jonquetetal.AMIASummitonTranslationalBioinformatics(2009)

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    5/21

    1.Collect genes/proteins fromEntrez Gene andUniProt

    2.Collect descriptivetext for each gene/protein from EntrezGene/UniProt

    3.Annotate text toover 200 ontologiesvia NCBOAnnotator

    Automaticgeneannotationpipeline

    1"

    Genome/Proteome*

    Q147X3**human*****The*status,*quality,*and*expansion*of*the*NIH*fullBlength*cDNAproject:*

    the*Mammalian*Gene*CollecKon*(MGC).*;*inaseBselecKve*enrichment*

    enables*quanKtaKve*phosphoproteomics*oShe*kinome*across*the*cell*

    cycle.*;*A*quanKtaKve*atlas*of*mitoKc*phosphorylaKon.*;*A*synopsis*of*

    eukaryoKc*NalphaBterminal*acetyltransferases:nomenclature,*subunits*

    a nd * s u bs t ra t es . * ; * n oc kd ow n* o f* h um a n* N * a l ph aB t er m in al *

    acetyltransferase*complex*C*leadsto*p53Bdependent** ** ** ** *apoptosis*

    and* aberrant* human*Arl8b*localizaKon.*;* Lysine* acetylaKon*targets*

    protein* complexes* and* coBregulates* majorcellular* funcKons.* ;B!B*

    FUNCTION:* CatalyKc* subunit* of* the* NBterminal* acetyltransferase*

    C(NatC)* c omplex.* Catalyzes* ac etylaKon * of* th e* NBtermin al*

    methionineresidues*of*pepKdes*beginning*with*MetBLeuBAla*and*MetB

    LeuBGly.Necessary* for* the* lysosomal* localizaKon* and* funcKon* of*

    ARL8.B!B* CATALYTIC* ACTIVITY:* AcetylBCoA* +* pepKde* =* N(alpha)B

    acetylpepKde+* CoA.B!B* UUNIT:* Component* of* the* NBterminal*

    acetyltransferase* C* (NatC)complex,* which* is* composed* of* NAA35,*LMD1* and* NAA30.B!B* UCELLULAR* LOCATION:* Cytoplasm.B!B*

    ALTERNATIVE* PRODUCT:Even t=Altern aKve* sp licin g;* Named*

    i s o f o r m s = 2 ; N a m e = 1 ; I s o I d = Q 1 4 7 X 3 B 1 ; *

    e q u e n c e = D i s p l a y e d ; N a m e = 2 ; I s o I d = Q 1 4 7 X 3 B 2 ;*

    eq u enc e=VP_031581;Note=No* exp erimen tal* c on firmaKon*

    available;B!B*IMILARITY:*elongs*to*the** ** * ** ** ** ** acetyltransferase*

    family.*MA3subfamily.B!B*IMILARITY:*Contains*1*NBacetyltransferase*

    domain.*B.**

    Gene"Ontology"

    iological*process*

    Apoptosis" signaling*

    Molecular*funcKon*

    Cellular*

    funcKon*

    2"

    3"

    Cell"cycle"ontology"

    iological*process*

    DNA*replicaKon*iniKaKon*

    *

    CytokineKc*process*

    iological*conKnuant*

    Acetyltransferase"

    Over*200*

    biomedical*

    ontologies*

    acetyltransferase*

    apoptosis*

    2"

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    6/21

    Gene/proteinspecifictextas

    annotationsource

    Gene text from Entrez Gene

    Protein text from UniProt

    Gene/Protein summary

    Publication titles

    GO annotations

    Pathway annotations

    GeneRIFs

    Protein complexes, domains, interactions

    We filter for author names, db names, numbers

    Q147X3''human'''''The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'

    Mammalian' Gene' Collecon' (MGC).' ;' Kinase>selecve' enrichment' enables'

    quantave'phosphoproteomics'ohe'kinome'across'the'cell'cycle.';'A'quantave'atlas' of' mitoc' phosphorylaon.' ;' A' synopsis' of' eukaryoc' Nalpha>terminal'

    acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'

    a lpha>terminal ' acety l t ransferase' complex' C ' leadsto' p53>dependent'

    apoptosis' and' aberrant' human' Arl8b' localizaon.' ;' Lysine' acetylaon'targets'protein'complexes'and'co>regulates'majorcellular'funcons.';>!>'FUNCTION:'

    Catalyc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'

    acetylaon'of' the'N> terminal'methionineresidues'of' pepdes'beginning' with'Met>

    Leu>Ala'and'Met>Leu>Gly.Necessary'for' the'lysosomal' localizaon'and'funcon'of'

    ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepde' =' N(alpha)>acetylpepde+'

    CoA.>!>'SUBUNIT:'Component' of'the' N>terminal'acetyltransferase'C' (NatC)complex,'

    which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'

    Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=Alternave' splicing;' Named'

    isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'Sequence=VSP_031581;Note=No' experimental' confirmaon' available;>!>'

    S IMILAR ITY:' Belongs' to' the' acetyltransferase' family.'MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:

    122830;'>.UCSC;' uc001xcx.2;'human.CTD;'122830;'>.GeneCards;' GC14P038022;'>.H>

    InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'

    > . Pha rmGKB ;' PA134931315; ' > . eggNOG ; ' p rNOG15463 ;' > .GeneTree ;'

    ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'

    >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'

    Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'

    HS_NAT12; ' >.Genevesgator;' Q147X3;' >.GO;' GO:0005737; ' C:cytoplasm;'

    IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepde' alpha>N>acetyltransferase'

    acvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'

    IPR016181;' Acyl_CoA_acyl transferase.Gene3D; ' G3DSA:3 .40.630.30;'

    Acyl_CoA_acyltransferase;'1.Pfam;' PF00583;'Acetyltransf_1;' 1.SUPFAM;'SSF55729;'

    Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.'

    '

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    7/21

    TheNCBOannotator1

    Simple string matching usingmgrep

    Synonyms are annotated

    Annotations are propagatedto the root

    No NLP

    Very fast

    Gene$Ontology$

    Biological$process$

    Apoptosis$ signaling$

    Molecular$func6on$

    Cellular$func6on$

    3"Cell$cycle$ontology$

    Biological$process$

    DNA$replica6on$

    ini6a6on$$

    Cytokine6c$process$

    Biological$con6nuant$

    Acetyltransferase$

    1C.Jonquetetal.AMIASummitonTranslationalBioinformatics(2009)

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    8/21

    683,753,623 annotations of 426,392 genes and proteins to

    529,544 terms from 267 ontologies for 7 organism (human,mouse, rat, fly, worm, yeast, E. coli)

    For human:

    94,844,772 annotations of 43,823 genes to436,576 terms

    146,221,448 annotations of 68,079 proteins to 373,222 terms

    Availability:

    RESTful webservice at: rest.mooneygroup.org

    Term enrichment tool STOP1: mooneygroup.org/stop

    Annotationresults

    1Wittkopetal.BMCBioinformatics(2013)

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    9/21

    Systematicevaluationofautomated

    annotations

    1.Annotate all (human) genesto terms from ontologiesoutside GO and OMIM, e.g. Phenotype Ontology,CHEBI, or Pathway Ontology.

    2.For each term (gene set) evaluate predictability,i.e. how well can we predict the genes that areannotated to it using existing gene function predictionmethods.

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    10/21

    Systematicevaluationofautomated

    annotations

    Use GeneMANIA1 for gene prioritization:

    Combine biological networks with more weight tonetworks that connect input genes

    Find closest genes in genome

    Fast, accurate and can be executed locally

    1Mostafavietal.GenomeBiol.(2008)

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    11/21

    Systematicevaluationofautomated

    annotations

    3-fold cross-validation of all terms (between 5 and1000 genes) using the gene prioritization toolGeneMANIA

    3-fold cross-validation of random control using twodistributions (uniform and gene-annotation-frequencybased)

    3-fold cross-validation of GO annotations from GOA1

    (including/excluding IEA annotations) and from DAVID2

    Use AUROC as quality measure to comparepredictability

    1E.Camonetal.NAR(2004),2D.Huangetal.GenomeBiol.(2007)

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    12/21

    Automatedannotationsaremore

    predictablethanrandom...

    For human genes: 127.000 out of 200.000 analyzed termsare statistically significant above random

    Control uniform Control annotation-frequency-based Automated annotations

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    13/21

    ...andperformcomparabletoexisting

    annotations

    GOA (EXP) GOA (IEA) DAVID (GO db) Automated annotations

    Note that GOA annotations are more sparse and have onaverage smaller gene sets

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    14/21

    Differencesbetweenontologiesdemand

    furtheranalysis

    PRotein Ontology (PRO)Molecule role (INOH Protein name/family name ontology)

    Cell Cycle Ontology

    Online Mendelian Inheritance in Man

    Gene Ontology Extension

    Gene Ontology

    Neural-Immune Gene Ontology

    Medical Subject Headings

    NIFSTDNational Drug File

    NCI Thesaurus

    CRISP Thesaurus 2006

    Logical Observation Identifier Names and Codes

    MedDRA

    Chemical entities of biological interest

    SNOMED Clinical Terms

    Experimental Factor OntologySuggested Ontology for Pharmacogenomics

    Read Codes Clinical Terms Version 3 (CTV3)

    Bone Dysplasia Ontology

    RadLex

    Galen

    Human developmental anatomy timed version

    0.78 0.803 0.825 0.848 0.87

    Ontologies with more then 1000 terms ordered by average AUROC

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    15/21

    ribonuclease P protein subunit p40 Protein Ontology AUC = 1

    Examples

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    16/21

    GO:0032041 (NAD-dependent histone deacetylase activity)

    Gene Ontology AUC = 1

    Examples

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    17/21

    GO:0072599 (establishment of protein localization in endoplasmic

    reticulum) Gene Ontology AUC = 0.99

    Examples

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    18/21

    Pancytopenia OMIM AUC = 0.96

    Examples

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    19/21

    Severe combined immunodeficiency OMIM AUC = 0.97

    Examples

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    20/21

    Existing Gene function prediction methods might beapplied to other gene annotations

    Automated annotations have prediction power

    Differences in prediction performance betweenontologies/terms exist

    Future directions: What are the important features forindividual ontologies or subgroups of terms ?

    Conclusions

  • 7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks

    21/21

    Thank you for your attention

    Special Thanks to ...

    Buck Institute for Research on AgingAdrian Bivol, Darcy Davis, Emily TerAvest, Uday Evani, Ari Berman,Tal Oron Ronnen, Mathew Fleisch, Corey

    Powell

    ! ! ! ! ! ! ! FundingNCBO! ! ! ! ! ! NIH R01 LM009722 (PI:Mooney),Nigam Shah and Trish Wetzel Stanford University National Center for Biomedical Ontology U54 HG004028,

    and the Buck Trust.! ! !