gene annotation databases

47
Gene Annotation Databases Gene Annotation Databases Gene Annotation Databases Gene Annotation Databases Disea ses Disea ses Disea ses Disea ses Disea ses Disea ses Disease s Anatom y Anatom y Anatom y Anatom y Anatom y Anatom y G ene s Gene s Gen e s G ene s Gene s G ene s Physiol ogy Physiol ogy Physi ol ogy Physiol ogy Physiol ogy Phys iolo gy Disease s Physiolo gy Anatomy Genes Genes Genes Disease s Disease s Medical Informati cs Genomics and Bioinformati cs Novel relationship s & Deeper insights

Upload: cheyenne-mendez

Post on 03-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Genes. Diseases. Diseases. Diseases. Physiology. Diseases. Physiology. Genes. Genes. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Anatomy. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Gene Annotation Databases

Gene Annotation Databases

Gene Annotation Databases

Gene Annotation Databases

Gene Annotation Databases

DiseasesDiseasesDiseasesDiseasesDiseasesDiseasesDiseases

Anatomy

Anatomy

Anatomy

Anatomy

Anatomy

Anatomy

Gen

esG

ene

sGen

esG

ene

sGen

esG

en

es

Physiology

Physiology

Physiology

Physiology

Physiology

Physiology

Diseases

Physiology

Anatomy

Genes

Genes

GenesDiseases

DiseasesMedical

Informatics

Genomics and Bioinformatics

Novel relationships & Deeper insights

Page 2: Gene Annotation Databases

Identification and Prioritization of Novel

Disease Candidate Genes Systems Biology Based Integrative Approaches

Anil JeggaDivision of Biomedical Informatics,

Cincinnati Children’s Hospital Medical Center (CCHMC)

Department of Pediatrics, University of CincinnatiCincinnati, Ohio - [email protected]://anil.cchmc.org

Bioinformatics to Systems Biology

November 16, 2007

Page 3: Gene Annotation Databases

Acknowledgements

• Jing Chen• Eric Bardes• Bruce Aronow

Cincinnati Children’s Hospital Medical Center Computational Medical Center, CincinnatiMouse Models of Human Cancers ConsortiumUniversity of Cincinnati College of Medicine

Support

• All the publicly available gene annotation resources especially NCBI, MGI and UCSC

Page 4: Gene Annotation Databases

Medical Informatics Bioinformatics & the “omes

Patient Records

Patient Records

Disease Database

Disease Database→Name→Synonyms→Related/Similar Diseases→Subtypes→Etiology →Predisposing Causes→Pathogenesis→Molecular Basis→Population Genetics→Clinical findings→System(s) involved→Lesions →Diagnosis→Prognosis→Treatment→Clinical Trials……

PubMed

Clinical Trials

Clinical Trials

Two Separate Worlds…..

With Some Data Exchange…

Genome

Transcriptome

miRNAome

Interactome

Metabolome

Physiome

Regulome Variome

Pathome Ph

arm

acog

en

om

e

OMIMClinical

Synopsis

Disease

World

382 “omes” so far………

and there is “UNKNOME” too - genes with no function knownhttp://omics.org/index.php/Alphabetically_ordered_list_of_omics

(as on November 15, 2007)

Proteome

Page 5: Gene Annotation Databases

PubMed

Medical Informatics

Patient Records

Patient Records

Disease Database

Disease Database

→Name→Synonyms→Related/Similar Diseases→Subtypes→Etiology →Predisposing Causes→Pathogenesis→Molecular Basis→Population Genetics→Clinical findings→System(s) involved→Lesions →Diagnosis→Prognosis→Treatment→Clinical Trials……

Clinical Trials

Clinical Trials

Bioinformatics

Genome

Transcriptome

Proteome

Interactome

Metabolome

Physiome

Regulome Variome

Pathome

Disease

World

OMIM

►Personalized Medicine►Decision Support System►Course/Outcome Predictor►Diagnostic Test Selector►Clinical Trials Design►Hypothesis Generator►Novel Gene/Drug Targets…..

Integrative

Genomics -

Biomedical

Informatics

the Ultimate Goal…….

miRNAome

Ph

arm

acog

en

om

e

Page 6: Gene Annotation Databases

No Integrative Genomics is Complete without Ontologies

• Gene Ontology (GO)

• Unified Medical Language System (UMLS)

Gene World Biomedical World

Page 7: Gene Annotation Databases

• Molecular Function = elemental activity/task– the tasks performed by individual gene products;

examples are carbohydrate binding and ATPase activity

– What a product ‘does’, precise activity

• Biological Process = biological goal or objective– broad biological goals, such as dna repair or purine

metabolism, that are accomplished by ordered assemblies of molecular functions

– Biological objective, accomplished via one or more ordered assemblies of functions

• Cellular Component = location or complex– subcellular structures, locations, and macromolecular

complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme

– ‘is located in’ (‘is a subcomponent of’ )

The 3 Gene Ontologies

http://www.geneontology.org

Page 8: Gene Annotation Databases

Function (what) Process (why)

Drive a nail - into wood Carpentry

Drive stake - into soil Gardening

Smash a bug Pest Control

A performer’s juggling object Entertainment

Example: Gene Product = hammer

http://www.geneontology.org

Page 9: Gene Annotation Databases

Unified Medical Language System Knowledge Server– UMLSKS

• The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems.

• The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain.

• The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word.

http://umlsks.nlm.nih.gov/kss

Page 10: Gene Annotation Databases

Unified Medical Language SystemMetathesaurus

• about over 1 million biomedical concepts • About 5 million concept names from more than 100 controlled

vocabularies and classifications (some in multiple languages) used in patient records, administrative health data, bibliographic and full-text databases and expert systems.

• The Metathesaurus is organized by concept or meaning. Alternate names for the same concept (synonyms, lexical variants, and translations) are linked together.

• Each Metathesaurus concept has attributes that help to define its meaning, e.g., the semantic type(s) or categories to which it belongs, its position in the hierarchical contexts from various source vocabularies, and, for many concepts, a definition.

• Customizable: Users can exclude vocabularies that are not relevant for specific purposes or not licensed for use in their institutions. MetamorphoSys, the multi-platform Java install and customization program distributed with the UMLS resources, helps users to generate pre-defined or custom subsets of the Metathesaurus.

• Uses: – linking between different clinical or biomedical vocabularies– information retrieval from databases with human assigned subject index

terms and from free-text information sources– linking patient records to related information in bibliographic, full-text, or

factual databases– natural language processing and automated indexing research

Page 11: Gene Annotation Databases

Open biomedical ontologies

http://obo.sourceforge.net/

Page 12: Gene Annotation Databases

Mammalian Phenotype Ontology1. The Mammalian Phenotype (MP)

Ontology enables robust annotation of mammalian phenotypes in the context of mutations, quantitative trait loci and strains that are used as models of human biology and disease.

2. Each node in MPO represents a category of phenotypes and each MP ontology term has a unique identifier, a definition, synonyms, and is associated with gene variants causing these phenotypes in genetically engineered or mutagenesis experiments.

3. In the current version of MPO, there are >4250 terms associated to >4300 unique Entrez mouse genes (extrapolated to ~4300 orthologous human genes).http://www.informatics.jax.org

Page 13: Gene Annotation Databases

Disease Gene Identification and Prioritization

Hypothesis: Majority of genes that impact or cause disease share membership in any of several functional relationships OR Functionally similar or related genes cause similar phenotype.

Functional Similarity – Common/shared•Gene Ontology term•Pathway•Phenotype•Chromosomal location•Expression•Cis regulatory elements (Transcription factor binding sites)•miRNA regulators•Interactions•Other features…..

Page 14: Gene Annotation Databases

1. Most of the common diseases are multi-factorial and modified by genetically and mechanistically complex polygenic interactions and environmental factors.

2. High-throughput genome-wide studies like linkage analysis and gene expression profiling, tend to be most useful for classification and characterization but do not provide sufficient information to identify or prioritize specific disease causal genes.

Background, Problems & Issues

Page 15: Gene Annotation Databases

3. Since multiple genes are associated with same or similar disease phenotypes, it is reasonable to expect the underlying genes to be functionally related.

4. Such functional relatedness (common pathway, interaction, biological process, etc.) can be exploited to aid in the finding of novel disease genes. For e.g., genetically heterogeneous hereditary diseases such as Hermansky-Pudlak syndrome and Fanconi anaemia have been shown to be caused by mutations in different interacting proteins.

Background, Problems & Issues

Page 16: Gene Annotation Databases

Background, Problems & Issues

Disease candidate gene studies

Biological experiments (expensive, time

consuming)

Linkage, gene expression

Potential candidate genes (too

many!)

Finemappin

g

Hand/cherr

y picking

Prioritization

approach

dilated cardiomyopathy

Linkage analysis

Locus region 10q25-26

Ellinor et al. J Am Coll Cardiol 2006.

~9.5Mb with 68 genes

7 candidates selected byexperts

ADRB1 missing

Page 17: Gene Annotation Databases

Assumption: genes involved in the same complex disease will have similar functions

dilated cardiomyopathy

Current candidate gene prioritization tools

Background, Problems & Issues

Input:Multiple locus

regions

Enriched functions

Prioritize genes basedon the functions

Approach without training

Training: Known disease genes (10 from OMIM)

Test: 68 genes at 10q25-26

Score test genesbased on their

similarity to training set

Approach with training

Page 18: Gene Annotation Databases

TOPPGeneTranscriptome Ontology Pathway based Prioritization of

Genes

http://toppgene.cchmc.org Chen J, Xu H, Aronow BJ, Jegga AG. 2007. Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8(1): 392 [Epub ahead of print]

Applications:1.For functional enrichment2.For candidate gene prioritization

Why another gene prioritization method?

Page 19: Gene Annotation Databases

Feature type POCUS

Prospectr

SUSPECTS

ENDEAVOUR

ToppGene

Year 2003 2005 2006 2006 2007

Sequence Features

GO Annotations

Transcript Features

Protein Features

Literature

Phenotype Annotations

Training set

Comparison with other related approaches

Page 20: Gene Annotation Databases

Feature type

POCUS Prospectr SUSPECTS ENDEAVOUR ToppGene

Year 2003 2005 2006 2006 2007

SequenceFeatures &Annotation

s

Gene lengthHomologyBase composition

Gene lengthHomologyBase

composition

Blastcis-element

Cytobandcis-elementmiRNA targetsGeneSets

GeneAnnotation

s

Gene Ontology

Gene Ontology Gene Ontology Gene Ontology Mouse Phenotype

TranscriptFeatures

Gene expression

Gene expression

EST expression

Gene expression

ProteinFeatures

domains Protein domains

domainsinteractionspathways

domainsinteractionspathways

Literature Keywords Co-citation

Training set

No No Yes Yes Yes

Comparison with other related approaches

Feature Details

Page 21: Gene Annotation Databases

We do not check whether the human orthologous gene of a mouse gene causes similar phenotype. Rather, we assume that orthologous genes cause “orthologous phenotype” and test the potential of the extrapolated mouse phenotype terms as a similarity measure to prioritize human disease candidate genes

Mammalian Phenotype Ontology

Page 22: Gene Annotation Databases

Mammalian Phenotype Ontology77 human genes explicitly associated

with “heart development” (GO:0007507)

Mouse orthologs cause various types of cardiac phenotype (MPO)

Page 23: Gene Annotation Databases

ToppGene – General Schema

Page 24: Gene Annotation Databases

TOPPGene - Data Sources1. Gene Ontology: GO and NCBI Entrez

Gene2. Mouse Phenotype: MGI (used for the first

time for human disease gene prioritization)3. Pathways: KEGG, BioCarta, BioCyc,

Reactome, GenMAPP, MSigDB4. Domains: UniProt (Pfam, Interpro,etc.)5. Interactions: NCBI Entrez Gene (Biogrid,

Reactome, BIND, HPRD, etc.)6. Pubmed IDs: NCBI Entrez Gene7. Expression: GEO8. Cytoband: MSigDB9. Cis-Elements: MSigDB10.miRNA Targets: MSigDB

New features added

Page 25: Gene Annotation Databases

TOPPGene - Validation

• Random-gene cross-validation– Disease-gene relations from OMIM

and GAD databases– Training set: disease genes with

one gene (“target”) removed– Test set: 100 genes = “target” gene

+ 99 random genes– Rank of “target” gene– Control: random training sets– AUC and Sensitivity/Specificity

Page 26: Gene Annotation Databases

Random-gene cross-validation: breast cancer example

Disease genes ATMBARD1BRCA1BRCA2BRIP1CASP8CHEK2KRASPALB2PIK3CAPPM1DRAD51RB1CC1SLC22A18TP53

Training set BARD1BRCA1BRCA2BRIP1CASP8CHEK2KRASPALB2PIK3CAPPM1DRAD51RB1CC1SLC22A18TP53

Test set KIAA1333 PQLC3 RBMY2OP ZNF133 LOC402643 FBL SLEB4 FAM32A AACSL ATM NDUFB5 DENND4A C14orf106 ……KCNJ16

99randomgenes

Ranked list 1. ATM2. KIAA1333

3. PQLC3

4. RBMY2OP

5. ZNF133

6. LOC402643

7. FBL

8. SLEB4

9. FAM32A

10. AACSL

11. NDUFB5

12. DENND4A

13. C14orf106

……100. KCNJ16

prioritization

TOPPGene - Validation

Page 27: Gene Annotation Databases

Random-gene cross-validation result

• Training:19 diseases with 693 genes

• Control: 20 random sets of 35 genes each

• Sensitivity/Specificity: 77/90

• AUC: 0.916Sensitivity: frequency of

“target” genes that are ranked above a particular threshold position

Specificity: the percentage of genes ranked below the threshold

False positive rate

Tru

e p

osi

tive

ra

te

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1 - specificity

Sen

sitiv

ity

False positive rate

Tru

e p

osi

tive

ra

te

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1 - specificity

Sen

sitiv

ity

Page 28: Gene Annotation Databases

Random-gene cross-validation with only one feature

AUC of different feature sets

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

All GO:MF GO:BP MP Pathway Domain Pubmed Interaction Expression

Feature set

AU

C

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Co

vera

ge

AUC (random control)

AUC (p-value score)

Coverage

Using Mouse Phenotype as a feature of similarity measure improves human disease

gene prioritization

Page 29: Gene Annotation Databases

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Tru

e p

osi

tive

ra

te

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Tru

e p

osi

tive

ra

te

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Tru

e p

osi

tive

ra

te

1-specificity

Sen

sitiv

ity

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Tru

e p

osi

tive

ra

te

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Tru

e p

osi

tive

ra

te

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Tru

e p

osi

tive

ra

te

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Tru

e p

osi

tive

ra

te

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Tru

e p

osi

tive

ra

te

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Tru

e p

osi

tive

ra

te

1-specificity

Sen

sitiv

ity

Overall performance

All features: 0.913All – MP: 0.893All – MP – PubMed:

0.888

All

All – MP

All – MP - Pubmed

Random-gene cross-validation by leaving one feature out

Sensitivity: true positive rate at a cutoff scoreSpecificity: true negative rate at the same cutoff

Using Mouse Phenotype as a feature of similarity measure improves human disease

gene prioritization

Page 30: Gene Annotation Databases

Locus-region cross-validation using different feature sets

FeaturesAverage rank ratio

of“target” genes

Number of times“target” genes wereranked top 5%

Number of times“target” genes

wereranked top 10%

All 7.39% 118 125

GO + MP + PubMed 7.50% 118 126

MP + PubMed 7.08% 121 126

Without GO 6.84% 117 123

Without Pathway 7.66% 118 124

Without Domain 6.71% 118 124

Without Interaction 7.17% 120 124

Without Expression 7.28% 118 128

Without MP 9.77% 110 117

Without Pubmed 9.91% 100 111

Without MP & Pubmed 22.61% 71 80

Page 31: Gene Annotation Databases

ToppGene web server (http://toppgene.cchmc.org)For functional enrichment analysis

Page 32: Gene Annotation Databases

ToppGene web server (http://toppgene.cchmc.org)For functional enrichment analysis

Page 33: Gene Annotation Databases

ToppGene web server (http://toppgene.cchmc.org)For functional enrichment analysis

Page 34: Gene Annotation Databases

ToppGene web server (http://toppgene.cchmc.org)For functional enrichment analysis

Page 35: Gene Annotation Databases

1. Direct protein–protein interactions (PPI) are one of the strongest manifestations of a functional relation between genes.

2. Hypothesis: Interacting proteins lead to same or similar disease phenotypes when mutated.

3. Several genetically heterogeneous hereditary diseases are shown to be caused by mutations in different interacting proteins. For e.g. Hermansky-Pudlak syndrome and Fanconi anaemia. Hence, protein–protein interactions might in principle be used to identify potentially interesting disease gene candidates.

PPI - Predicting Disease Genes

Page 36: Gene Annotation Databases

Known Disease Genes

Direct Interactants of Disease Genes

Mining human interactome

HPRDBioGrid

Which of these interactants are potential new candidates?

Indirect Interactants of Disease Genes

7

66

778

Prioritize candidate genes in the interacting partners of the disease-related genes•Training sets: disease related genes •Test sets: interacting partners of the training genes

Page 37: Gene Annotation Databases

Example: Breast cancer

OMIM genes (level 0)

Directly interacting genes (level 1)

Indirectly interacting genes (level2)

15 342 2469!

15 342 2469

Page 38: Gene Annotation Databases

ToppGene web server (http://toppgene.cchmc.org)For candidate gene prioritization

Page 39: Gene Annotation Databases

ToppGene web server (http://toppgene.cchmc.org)For candidate gene prioritization

Page 40: Gene Annotation Databases

ToppGene web server (http://toppgene.cchmc.org)For candidate gene prioritization

Page 41: Gene Annotation Databases

Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007 May 27.

rs id Location

Gene Training set Test set

rs2981582 10q26 FGFR2 15 OMIM genes

83 genes in the region

Prioritization result:

Rank Gene Description P-value

1 BUB3 budding uninhibited by benzimidazoles 3 homolog

0.003865

2 FGFR2 fibroblast growth factor receptor 2 0.018906

3 BCCIP BRCA2 and CDKN1A interacting protein 0.04784

Page 42: Gene Annotation Databases

Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007 May 27.

Page 43: Gene Annotation Databases

ToppGene PrioritizationExample: Breast cancer

Ranked InteractantsRank

Gene Description

1 ATR ataxia telangiectasia and Rad3 related

2 FANCD2 Fanconi anemia, complementation group D2

3 NBN (NBS1) Nibrin

Training set Test set

15 OMIM genes

342 interacting genes

Page 44: Gene Annotation Databases

LimitationsGeneral limitations of any training-test

strategy:•Prior knowledge of disease-gene associations.•Assumption that the disease genes yet to discover will be consistent with what is already known about a disease.•Depend on the accuracy and completeness of the functional annotations.

– Only one-fifth of the known human genes have pathway or phenotype annotations and there are still more than 40% genes whose functions are not defined!

Chen et al., 2007; BMC Bioinformatics

Page 45: Gene Annotation Databases

Mouse Phenotype - Limitations1.MP is not a disease-centric ontology and the

phenotype of a same gene mutation can vary depending on specific mouse strains or their genetic backgrounds.

2.Orthologous genes need not necessarily result in orthologous phenotypes.

Possible Solutions - Future DirectionsMore efficient cross-species phenome extrapolation where in the mouse phenotype terms are mapped to human phenotype concepts (from UMLS) semantically (“orthologous phenotype”) and the resultant orthologous genes associated with an orthologous phenotype are identified.

Chen et al., 2007; BMC Bioinformatics

Page 46: Gene Annotation Databases

PPIs for disease gene identificationLimitations1.Noisy interactome data

• In vitro Vs in vivo (for e.g. only 5.8% of yeast two-hybrid predicted interactions were confirmed by HPRD)

• Extrapolation of interactions from one species to another

• Bias towards “well-studied” genes/proteins2.Too many interactants! Hub proteins3.Two interacting proteins need not lead to similar

phenotype when mutated4.Disease proteins may lie at different points in a

pathway and need not interact directly5.Lastly, disease mutations need not always

involve proteins Oti et al., 2006; J Med Gen

Page 47: Gene Annotation Databases

http://sbw.kgi.edu/

http://anil.cchmc.org (under presentations)

Thank You!

And PRIORITIZATION too!