genome and proteome annotation using automatically recognized concepts and functional networks
TRANSCRIPT
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
1/21
Genomeandproteomeannotationusingautomaticallyrecognizedconceptsand
functionalnetworks
AdrianBivol,TobiasWittkop,DarcyDavis,andSeanMooney
Mooneylaboratory,BuckInstituteforResearchonAging,Novato,CA
NationalCenterforBiomedicalOntology,StanfordUniversity,Stanford,CA
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
2/21
Genefunction/diseaseprediction
Typically uses Gene Ontology (GO) or disease annotation (e.g. OMIM)
Many tools utilize similar set of features/networks, e.g. PPI networks,
co-expression networks, sequence similarity,...
Input: Set of genes with known function/disease
Output: ranked list of remaining genes (closest at the top)
Can these tools be used for other annotations then GO or disease?
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
3/21
Systematicevaluationofautomated
annotations
1.Annotate all (human) genesto terms from ontologiesoutside GO and OMIM, e.g. Phenotype Ontology,CHEBI, or Pathway Ontology.
2.For each term (gene set) evaluate predictability,i.e. how well can we predict the genes that areannotated to it using existing gene function predictionmethods.
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
4/21
GeneannotationsoutsideofGO
NCBO currently includes over 250 ontologies
Ontologies are structured controlled vocabularies
Gene/protein summary in Entrez Gene and UniProt often moreup-to-date than manually curated GO
NCBO provides annotator service1 that matches text to terms
1C.Jonquetetal.AMIASummitonTranslationalBioinformatics(2009)
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
5/21
1.Collect genes/proteins fromEntrez Gene andUniProt
2.Collect descriptivetext for each gene/protein from EntrezGene/UniProt
3.Annotate text toover 200 ontologiesvia NCBOAnnotator
Automaticgeneannotationpipeline
1"
Genome/Proteome*
Q147X3**human*****The*status,*quality,*and*expansion*of*the*NIH*fullBlength*cDNAproject:*
the*Mammalian*Gene*CollecKon*(MGC).*;*inaseBselecKve*enrichment*
enables*quanKtaKve*phosphoproteomics*oShe*kinome*across*the*cell*
cycle.*;*A*quanKtaKve*atlas*of*mitoKc*phosphorylaKon.*;*A*synopsis*of*
eukaryoKc*NalphaBterminal*acetyltransferases:nomenclature,*subunits*
a nd * s u bs t ra t es . * ; * n oc kd ow n* o f* h um a n* N * a l ph aB t er m in al *
acetyltransferase*complex*C*leadsto*p53Bdependent** ** ** ** *apoptosis*
and* aberrant* human*Arl8b*localizaKon.*;* Lysine* acetylaKon*targets*
protein* complexes* and* coBregulates* majorcellular* funcKons.* ;B!B*
FUNCTION:* CatalyKc* subunit* of* the* NBterminal* acetyltransferase*
C(NatC)* c omplex.* Catalyzes* ac etylaKon * of* th e* NBtermin al*
methionineresidues*of*pepKdes*beginning*with*MetBLeuBAla*and*MetB
LeuBGly.Necessary* for* the* lysosomal* localizaKon* and* funcKon* of*
ARL8.B!B* CATALYTIC* ACTIVITY:* AcetylBCoA* +* pepKde* =* N(alpha)B
acetylpepKde+* CoA.B!B* UUNIT:* Component* of* the* NBterminal*
acetyltransferase* C* (NatC)complex,* which* is* composed* of* NAA35,*LMD1* and* NAA30.B!B* UCELLULAR* LOCATION:* Cytoplasm.B!B*
ALTERNATIVE* PRODUCT:Even t=Altern aKve* sp licin g;* Named*
i s o f o r m s = 2 ; N a m e = 1 ; I s o I d = Q 1 4 7 X 3 B 1 ; *
e q u e n c e = D i s p l a y e d ; N a m e = 2 ; I s o I d = Q 1 4 7 X 3 B 2 ;*
eq u enc e=VP_031581;Note=No* exp erimen tal* c on firmaKon*
available;B!B*IMILARITY:*elongs*to*the** ** * ** ** ** ** acetyltransferase*
family.*MA3subfamily.B!B*IMILARITY:*Contains*1*NBacetyltransferase*
domain.*B.**
Gene"Ontology"
iological*process*
Apoptosis" signaling*
Molecular*funcKon*
Cellular*
funcKon*
2"
3"
Cell"cycle"ontology"
iological*process*
DNA*replicaKon*iniKaKon*
*
CytokineKc*process*
iological*conKnuant*
Acetyltransferase"
Over*200*
biomedical*
ontologies*
acetyltransferase*
apoptosis*
2"
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
6/21
Gene/proteinspecifictextas
annotationsource
Gene text from Entrez Gene
Protein text from UniProt
Gene/Protein summary
Publication titles
GO annotations
Pathway annotations
GeneRIFs
Protein complexes, domains, interactions
We filter for author names, db names, numbers
Q147X3''human'''''The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'
Mammalian' Gene' Collecon' (MGC).' ;' Kinase>selecve' enrichment' enables'
quantave'phosphoproteomics'ohe'kinome'across'the'cell'cycle.';'A'quantave'atlas' of' mitoc' phosphorylaon.' ;' A' synopsis' of' eukaryoc' Nalpha>terminal'
acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'
a lpha>terminal ' acety l t ransferase' complex' C ' leadsto' p53>dependent'
apoptosis' and' aberrant' human' Arl8b' localizaon.' ;' Lysine' acetylaon'targets'protein'complexes'and'co>regulates'majorcellular'funcons.';>!>'FUNCTION:'
Catalyc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'
acetylaon'of' the'N> terminal'methionineresidues'of' pepdes'beginning' with'Met>
Leu>Ala'and'Met>Leu>Gly.Necessary'for' the'lysosomal' localizaon'and'funcon'of'
ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepde' =' N(alpha)>acetylpepde+'
CoA.>!>'SUBUNIT:'Component' of'the' N>terminal'acetyltransferase'C' (NatC)complex,'
which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'
Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=Alternave' splicing;' Named'
isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'Sequence=VSP_031581;Note=No' experimental' confirmaon' available;>!>'
S IMILAR ITY:' Belongs' to' the' acetyltransferase' family.'MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:
122830;'>.UCSC;' uc001xcx.2;'human.CTD;'122830;'>.GeneCards;' GC14P038022;'>.H>
InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'
> . Pha rmGKB ;' PA134931315; ' > . eggNOG ; ' p rNOG15463 ;' > .GeneTree ;'
ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'
>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'
Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'
HS_NAT12; ' >.Genevesgator;' Q147X3;' >.GO;' GO:0005737; ' C:cytoplasm;'
IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepde' alpha>N>acetyltransferase'
acvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'
IPR016181;' Acyl_CoA_acyl transferase.Gene3D; ' G3DSA:3 .40.630.30;'
Acyl_CoA_acyltransferase;'1.Pfam;' PF00583;'Acetyltransf_1;' 1.SUPFAM;'SSF55729;'
Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.'
'
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
7/21
TheNCBOannotator1
Simple string matching usingmgrep
Synonyms are annotated
Annotations are propagatedto the root
No NLP
Very fast
Gene$Ontology$
Biological$process$
Apoptosis$ signaling$
Molecular$func6on$
Cellular$func6on$
3"Cell$cycle$ontology$
Biological$process$
DNA$replica6on$
ini6a6on$$
Cytokine6c$process$
Biological$con6nuant$
Acetyltransferase$
1C.Jonquetetal.AMIASummitonTranslationalBioinformatics(2009)
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
8/21
683,753,623 annotations of 426,392 genes and proteins to
529,544 terms from 267 ontologies for 7 organism (human,mouse, rat, fly, worm, yeast, E. coli)
For human:
94,844,772 annotations of 43,823 genes to436,576 terms
146,221,448 annotations of 68,079 proteins to 373,222 terms
Availability:
RESTful webservice at: rest.mooneygroup.org
Term enrichment tool STOP1: mooneygroup.org/stop
Annotationresults
1Wittkopetal.BMCBioinformatics(2013)
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
9/21
Systematicevaluationofautomated
annotations
1.Annotate all (human) genesto terms from ontologiesoutside GO and OMIM, e.g. Phenotype Ontology,CHEBI, or Pathway Ontology.
2.For each term (gene set) evaluate predictability,i.e. how well can we predict the genes that areannotated to it using existing gene function predictionmethods.
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
10/21
Systematicevaluationofautomated
annotations
Use GeneMANIA1 for gene prioritization:
Combine biological networks with more weight tonetworks that connect input genes
Find closest genes in genome
Fast, accurate and can be executed locally
1Mostafavietal.GenomeBiol.(2008)
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
11/21
Systematicevaluationofautomated
annotations
3-fold cross-validation of all terms (between 5 and1000 genes) using the gene prioritization toolGeneMANIA
3-fold cross-validation of random control using twodistributions (uniform and gene-annotation-frequencybased)
3-fold cross-validation of GO annotations from GOA1
(including/excluding IEA annotations) and from DAVID2
Use AUROC as quality measure to comparepredictability
1E.Camonetal.NAR(2004),2D.Huangetal.GenomeBiol.(2007)
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
12/21
Automatedannotationsaremore
predictablethanrandom...
For human genes: 127.000 out of 200.000 analyzed termsare statistically significant above random
Control uniform Control annotation-frequency-based Automated annotations
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
13/21
...andperformcomparabletoexisting
annotations
GOA (EXP) GOA (IEA) DAVID (GO db) Automated annotations
Note that GOA annotations are more sparse and have onaverage smaller gene sets
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
14/21
Differencesbetweenontologiesdemand
furtheranalysis
PRotein Ontology (PRO)Molecule role (INOH Protein name/family name ontology)
Cell Cycle Ontology
Online Mendelian Inheritance in Man
Gene Ontology Extension
Gene Ontology
Neural-Immune Gene Ontology
Medical Subject Headings
NIFSTDNational Drug File
NCI Thesaurus
CRISP Thesaurus 2006
Logical Observation Identifier Names and Codes
MedDRA
Chemical entities of biological interest
SNOMED Clinical Terms
Experimental Factor OntologySuggested Ontology for Pharmacogenomics
Read Codes Clinical Terms Version 3 (CTV3)
Bone Dysplasia Ontology
RadLex
Galen
Human developmental anatomy timed version
0.78 0.803 0.825 0.848 0.87
Ontologies with more then 1000 terms ordered by average AUROC
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
15/21
ribonuclease P protein subunit p40 Protein Ontology AUC = 1
Examples
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
16/21
GO:0032041 (NAD-dependent histone deacetylase activity)
Gene Ontology AUC = 1
Examples
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
17/21
GO:0072599 (establishment of protein localization in endoplasmic
reticulum) Gene Ontology AUC = 0.99
Examples
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
18/21
Pancytopenia OMIM AUC = 0.96
Examples
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
19/21
Severe combined immunodeficiency OMIM AUC = 0.97
Examples
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
20/21
Existing Gene function prediction methods might beapplied to other gene annotations
Automated annotations have prediction power
Differences in prediction performance betweenontologies/terms exist
Future directions: What are the important features forindividual ontologies or subgroups of terms ?
Conclusions
-
7/28/2019 Genome and Proteome Annotation Using Automatically Recognized Concepts and Functional Networks
21/21
Thank you for your attention
Special Thanks to ...
Buck Institute for Research on AgingAdrian Bivol, Darcy Davis, Emily TerAvest, Uday Evani, Ari Berman,Tal Oron Ronnen, Mathew Fleisch, Corey
Powell
! ! ! ! ! ! ! FundingNCBO! ! ! ! ! ! NIH R01 LM009722 (PI:Mooney),Nigam Shah and Trish Wetzel Stanford University National Center for Biomedical Ontology U54 HG004028,
and the Buck Trust.! ! !