Download - Integrative Functional Genomics
Medical Informatics Bioinformatics & the “omes”
Patient Records
Patient Records
Disease Database
Disease Database→Name→Synonyms→Related/Similar Diseases→Subtypes→Etiology →Predisposing Causes→Pathogenesis→Molecular Basis→Population Genetics→Clinical findings→System(s) involved→Lesions →Diagnosis→Prognosis→Treatment→Clinical Trials……
PubMed
Clinical Trials
Clinical Trials
Two Separate Worlds…..
With Some Data Exchange…
Genome
Transcriptome
miRNAome
Interactome
Metabolome
Physiome
Regulome Variome
Pathome Ph
arm
acog
en
om
e
OMIMClinical
Synopsis
Disease
World
>380 “omes” so far………
and there is “UNKNOME” too - genes with no function knownhttp://en.wikipedia.org/wiki/List_of_omics_topics_in_biology
http://omics.org/index.php/Alphabetically_ordered_list_of_omics
Proteome
To correlate diseases with anatomical parts affected, the genes/proteins involved, and the underlying physiological processes (interactions, pathways, processes). In other words, bringing the disciplines of Medical Informatics (MI) and BioInformatics (BI) together (Biomedical Informatics - BMI) to support personalized or “tailor-made” medicine.
Motivation
How to integrate multiple types of genome-scale data across experiments and phenotypes in order to find genes associated with diseases and drug
response
Model Organism Databases: Common Issues
• Heterogeneous Data Sets - Data Integration– From Genotype to Phenotype– Experimental and Consensus Views
• Incorporation of Large Datasets– Whole genome annotation pipelines– Large scale mutagenesis/variation projects
(dbSNP)
• Computational vs. Literature-based Data Collection and Evaluation (MedLine)
• Data Mining– extraction of new knowledge– testable hypotheses (Hypothesis Generation)
Support Complex Queries• Show me all genes involved in brain
development that are expressed in the Central Nervous System.
• Show me all genes involved in brain development in human and mouse that also show iron ion binding activity.
• For this set of genes, what aspects of function and/or cellular localization do they share?
• For this set of genes, what mutations are reported to cause pathological conditions?
Bioinformatic Data-1978 to present
• DNA sequence• Gene expression• Protein expression• Protein Structure• Genome mapping• SNPs & Mutations
• Metabolic networks• Regulatory networks• Trait mapping• Gene function
analysis• Scientific literature• and others………..
Human Genome Project – Data Deluge
No. of Human Gene Records currently in NCBI: ~30K (excluding pseudogenes, mitochondrial genes and obsolete records).
Includes ~700 microRNAs
NCBI Human Genome Statistics – as on November 4, 2009
The Gene Expression Data DelugeTill 2000: 413 papers on microarray!
YearPubMed Articles
2001 834
2002 1557
2003 2421
2004 3508
2005 4400
2006 4824
2007 5108
2008 5884
2009 5207…..
Problems Deluge!Allison DB, Cui X, Page GP, Sabripour M. 2006. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 7(1): 55-65.
• 3 scientific journals in 1750
• Now - >120,000 scientific journals!
• >500,000 medical articles/year
• >4,000,000 scientific articles/year
• >16 million abstracts in PubMed derived from >32,500 journals
Information Deluge…..
A researcher would have to scan 130 different journals and read 27 papers per day to follow a single disease, such as breast cancer (Baasiri et al., 1999 Oncogene 18: 7958-7965).
•Accelerin•Antiquitin•Bang Senseless•Bride of Sevenless•Christmas Factor•Cockeye•Crack•Draculin•Dickie’s small eye
Disease names• Mobius Syndrome with
Poland’s Anomaly• Werner’s syndrome• Down’s syndrome• Angelman’s syndrome• Creutzfeld-Jacob
disease
•Draculin•Fidgetin•Gleeful•Knobhead•Lunatic Fringe•Mortalin•Orphanin•Profilactin•Sonic Hedgehog
Data-driven Problems…..
Gene Nomenclature
• How to name or describe proteins, genes, drugs, diseases and conditions consistently and coherently?
• How to ascribe and name a function, process or location consistently?
• How to describe interactions, partners, reactions and complexes?
• Develop/Use controlled or restricted vocabularies (IUPAC-like naming conventions, HGNC, MGI, UMLS, etc.)
• Create/Use thesauruses, central repositories or synonym lists (MeSH, UMLS, etc.)
• Work towards synoptic reporting and structured abstracting
Some Solutions
1. Generally, the names refer to some feature of the mutant phenotype
2. Dickie’s small eye (Thieler et al., 1978, Anat Embryol (Berl), 155: 81-86) is now Pax6
3. Gleeful: "This gene encodes a C2H2 zinc finger transcription factor with high sequence similarity to vertebrate Gli proteins, so we have named the gene gleeful (Gfl)." (Furlong et al., 2001, Science 293: 1632)
What’s in a name!Rose is a rose is a rose is a rose!
Rose is a rose is a rose is a rose….. Not Really!
Image Sources: Somewhere from the internet…
What is a cell?• any small compartment;
• (biology) the basic structural and functional unit of all organisms; they may exist as independent units of life (as in monads) or may form colonies or tissues as in higher plants and animals
• a device that delivers an electric current as the result of a chemical reaction
• a small unit serving as part of or as the nucleus of a larger political movement
• cellular telephone: a hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver
• small room is which a monk or nun lives
• a room where a prisoner is kept
Foundation Model Explorer
Semantic Groups, Types and Concepts:
• Semantic Group Biology – Semantic Type Cell
• Semantic Groups Object OR Devices – Semantic Types Manufactured Device or Electrical Device or Communication Device
• Semantic Group Organization – Semantic Type Political Group
Database name
No. of Records
Query= p53
Query= TP53
(HGNC)
Query= p53 OR TP53
PubMed 48,679 3360 49,469
PMC 21,193 1529 21,564
Book 782 504 820
Nucleotide 9473 592 9773
Protein 6219 509 6377
Genome 22 1 23
OMIM 403 141 414
SNP 424 337 453
Gene 1642 338 1750
Homologene 63 9 68
GEO Profiles 352,684 15,140 358,999
Cancer Chr 302 161 463
Hepatocellular Carcinoma
CTNNB1
MET
TP53
1. COLORECTAL CANCER [3-BP DEL, SER45DEL]2. COLORECTAL CANCER [SER33TYR]3. PILOMATRICOMA, SOMATIC [SER33TYR]4. HEPATOBLASTOMA, SOMATIC [THR41ALA]5. DESMOID TUMOR, SOMATIC [THR41ALA]6. PILOMATRICOMA, SOMATIC [ASP32GLY]7. OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC [SER37CYS]8. HEPATOCELLULAR CARCINOMA SOMATIC [SER45PHE]9. HEPATOCELLULAR CARCINOMA SOMATIC [SER45PRO]10. MEDULLOBLASTOMA, SOMATIC [SER33PHE]
1. COLORECTAL CANCER [3-BP DEL, SER45DEL]2. COLORECTAL CANCER [SER33TYR]3. PILOMATRICOMA, SOMATIC [SER33TYR]4. HEPATOBLASTOMA, SOMATIC [THR41ALA]5. DESMOID TUMOR, SOMATIC [THR41ALA]6. PILOMATRICOMA, SOMATIC [ASP32GLY]7. OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC [SER37CYS]8. HEPATOCELLULAR CARCINOMA SOMATIC [SER45PHE]9. HEPATOCELLULAR CARCINOMA SOMATIC [SER45PRO]10. MEDULLOBLASTOMA, SOMATIC [SER33PHE]
1. HEPATOCELLULAR CARCINOMA SOMATIC [ARG249SER]
1. HEPATOCELLULAR CARCINOMA SOMATIC [ARG249SER]
TP53*
aflatoxin B1, a mycotoxin induces a very specific G-to-T mutation at codon 249 in the tumor suppressor gene p53.
Environmental Effects
Many disease states are complex, because of many genes (alleles & ethnicity, gene families, etc.), environmental effects (life style, exposure, etc.) and the interactions.
The REAL Problems
HEPATOCELLULAR CARCINOMALIVER:
•Hepatocellular carcinoma; •Micronodular cirrhosis; •Subacute progressive viral hepatitis
NEOPLASIA: •Primary liver cancer
CTNNB1
MET
TP53
1. ALK in cardiac myocytes 2. Cell to Cell Adhesion Signaling 3. Inactivation of Gsk3 by AKT causes
accumulation of b-catenin in Alveolar Macrophages
4. Multi-step Regulation of Transcription by Pitx2 5. Presenilin action in Notch and Wnt signaling 6. Trefoil Factors Initiate Mucosal Healing 7. WNT Signaling Pathway
1. ALK in cardiac myocytes 2. Cell to Cell Adhesion Signaling 3. Inactivation of Gsk3 by AKT causes
accumulation of b-catenin in Alveolar Macrophages
4. Multi-step Regulation of Transcription by Pitx2 5. Presenilin action in Notch and Wnt signaling 6. Trefoil Factors Initiate Mucosal Healing 7. WNT Signaling Pathway
1. CBL mediated ligand-induced downregulation of EGF receptors
2. Signaling of Hepatocyte Growth Factor Receptor
1. CBL mediated ligand-induced downregulation of EGF receptors
2. Signaling of Hepatocyte Growth Factor Receptor
1. Estrogen-responsive protein Efp controls cell cycle and breast tumors growth
2. ATM Signaling Pathway 3. BTG family proteins and cell cycle
regulation 4. Cell Cycle 5. RB Tumor Suppressor/Checkpoint
Signaling in response to DNA damage
6. Regulation of transcriptional activity by PML
7. Regulation of cell cycle progression by Plk3
8. Hypoxia and p53 in the Cardiovascular system
9. p53 Signaling Pathway 10. Apoptotic Signaling in Response to
DNA Damage 11. Role of BRCA1, BRCA2 and ATR in
Cancer Susceptibility….Many More…..
1. Estrogen-responsive protein Efp controls cell cycle and breast tumors growth
2. ATM Signaling Pathway 3. BTG family proteins and cell cycle
regulation 4. Cell Cycle 5. RB Tumor Suppressor/Checkpoint
Signaling in response to DNA damage
6. Regulation of transcriptional activity by PML
7. Regulation of cell cycle progression by Plk3
8. Hypoxia and p53 in the Cardiovascular system
9. p53 Signaling Pathway 10. Apoptotic Signaling in Response to
DNA Damage 11. Role of BRCA1, BRCA2 and ATR in
Cancer Susceptibility….Many More…..
The REAL Problems
Integrative Genomics - what is it?Another buzzword or a meaningful concept useful for
biomedical research?
Acquisition, Integration, Curation, and Analysis of biological data
Integrative Genomics: the study of complex interactions between genes, organism and environment, the triple helix of biology. Gene <–> Organism <-> Environment
It is definitely beyond the buzzword stage - Universities now have programs named 'Integrated Genomics.'
Hypothesis
Information is not knowledge - Albert Einstein
1. Link driven federations• Explicit links between databanks.
2. Warehousing• Data is downloaded, filtered,
integrated and stored in a warehouse. Answers to queries are taken from the warehouse.
3. Others….. Semantic Web, etc………
Methods for Integration
1. Creates explicit links between databanks
2. query: get interesting results and use web links to reach related data in other databanks
Examples: NCBI-Entrez, SRS
Link-driven Federations
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
1.Advantages• complex queries• Fast
2.Disadvantages• require good knowledge• syntax based• terminology problem not
solved
Link-driven Federations
Data is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse.
Data Warehousing
Advantages1. Good for very-specific,
task-based queries and studies.
2. Since it is custom-built and usually expert-curated, relatively less error-prone
Disadvantages1. Can become quickly
outdated – needs constant updates.
2. Limited functionality – For e.g., one disease-based or one system-based.
No Integrative Genomics is Complete without Ontologies
• Gene Ontology (GO)
• Unified Medical Language System (UMLS)
Gene World Biomedical World
• Molecular Function = elemental activity/task– the tasks performed by individual gene products;
examples are carbohydrate binding and ATPase activity
– What a product ‘does’, precise activity
• Biological Process = biological goal or objective– broad biological goals, such as dna repair or purine
metabolism, that are accomplished by ordered assemblies of molecular functions
– Biological objective, accomplished via one or more ordered assemblies of functions
• Cellular Component = location or complex– subcellular structures, locations, and macromolecular
complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme
– ‘is located in’ (‘is a subcomponent of’ )
The 3 Gene Ontologies
http://www.geneontology.org
Function (what) Process (why)
Drive a nail - into wood Carpentry
Drive stake - into soil Gardening
Smash a bug Pest Control
A performer’s juggling object Entertainment
Example: Gene Product = hammer
http://www.geneontology.org
• ISS: Inferred from sequence or structural similarity
• IDA: Inferred from direct assay• IPI: Inferred from physical interaction• TAS: Traceable author statement• IMP: Inferred from mutant phenotype• IGI: Inferred from genetic interaction• IEP: Inferred from expression pattern• ND: no data available
GO term associations: Evidence Codes
http://www.geneontology.org
• Access gene product functional information
• Find how much of a proteome is involved in a process/ function/ component in the cell
• Map GO terms and incorporate manual annotations into own databases
• Provide a link between biological knowledge and
• gene expression profiles
• proteomics data
What can researchers do with GO?
• Getting the GO and GO_Association Files
• Data Mining– My Favorite Gene– By GO– By Sequence
• Analysis of Data– Clustering by
function/process• Other Tools
And how?
http://www.geneontology.org/
Gene list enrichment analysis tools (DAVID, FatiGO, ToppGene)
Open biomedical ontologies
http://obo.sourceforge.net/
Unified Medical Language System Knowledge Server– UMLSKS
http://umlsks.nlm.nih.gov/kss/
• The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems.
• The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain.
• The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word.
Unified Medical Language SystemMetathesaurus
• about >1 million biomedical concepts • About 5 million concept names from more than 100 controlled
vocabularies and classifications (some in multiple languages) used in patient records, administrative health data, bibliographic and full-text databases and expert systems.
• The Metathesaurus is organized by concept or meaning. Alternate names for the same concept (synonyms, lexical variants, and translations) are linked together.
• Each Metathesaurus concept has attributes that help to define its meaning, e.g., the semantic type(s) or categories to which it belongs, its position in the hierarchical contexts from various source vocabularies, and, for many concepts, a definition.
• Customizable: Users can exclude vocabularies that are not relevant for specific purposes or not licensed for use in their institutions. MetamorphoSys, the multi-platform Java install and customization program distributed with the UMLS resources, helps users to generate pre-defined or custom subsets of the Metathesaurus.
• Uses: – linking between different clinical or biomedical vocabularies– information retrieval from databases with human assigned subject index
terms and from free-text information sources– linking patient records to related information in bibliographic, full-text, or
factual databases– natural language processing and automated indexing research
UMLSKS – Semantic Network
• Complexity reduced by grouping concepts according to the semantic types that have been assigned to them.
• There are currently 15 semantic groups that provide a partition of the UMLS Metathesaurus for 99.5% of the concepts.ACTI|Activities & Behaviors|T053|Behavior
ANAT|Anatomy|T024|Tissue
CHEM|Chemicals & Drugs|T195|Antibiotic
CONC|Concepts & Ideas|T170|Intellectual Product
DEVI|Devices|T074|Medical Device
DISO|Disorders|T047|Disease or Syndrome
GENE|Genes & Molecular Sequences|T085|Molecular Sequence
GEOG|Geographic Areas|T083|Geographic Area
LIVB|Living Beings|T005|Virus
OBJC|Objects|T073|Manufactured Object
OCCU|Occupations|T091|Biomedical Occupation or Discipline
ORGA|Organizations|T093|Health Care Related Organization
PHEN|Phenomena|T038|Biologic Function
PHYS|Physiology|T040|Organism Function
PROC|Procedures|T061|Therapeutic or Preventive Procedure
Semantic Groups (15)
Semantic Types (135) Concepts
(millions)
UMLSKS – Semantic Navigator
Part 2Integrative Functional
Genomic Approaches to Identify and Prioritize
Disease Genes
Disease Gene Identification and Prioritization
Hypothesis: Majority of genes that impact or cause disease share membership in any of several functional relationships OR Functionally similar or related genes cause similar phenotype.
Functional Similarity – Common/shared•Gene Ontology term•Pathway•Phenotype•Chromosomal location•Expression•Cis regulatory elements (Transcription factor binding sites)•miRNA regulators•Interactions•Other features…..
1. Most of the common diseases are multi-factorial and modified by genetically and mechanistically complex polygenic interactions and environmental factors.
2. High-throughput genome-wide studies like linkage analysis and gene expression profiling, tend to be most useful for classification and characterization but do not provide sufficient information to identify or prioritize specific disease causal genes.
Background, Problems & Issues
3. Since multiple genes are associated with same or similar disease phenotypes, it is reasonable to expect the underlying genes to be functionally related.
4. Such functional relatedness (common pathway, interaction, biological process, etc.) can be exploited to aid in the finding of novel disease genes. For e.g., genetically heterogeneous hereditary diseases such as Hermansky-Pudlak syndrome and Fanconi anaemia have been shown to be caused by mutations in different interacting proteins.
Background, Problems & Issues
1. Direct protein–protein interactions (PPI) are one of the strongest manifestations of a functional relation between genes.
2. Hypothesis: Interacting proteins lead to same or similar disease phenotypes when mutated.
3. Several genetically heterogeneous hereditary diseases are shown to be caused by mutations in different interacting proteins. For e.g. Hermansky-Pudlak syndrome and Fanconi anaemia. Hence, protein–protein interactions might in principle be used to identify potentially interesting disease gene candidates.
PPI - Predicting Disease Genes
Known Disease Genes
Direct Interactants of Disease Genes
Mining human interactome
HPRDBioGrid
Which of these interactants are potential new candidates?
Indirect Interactants of Disease Genes
7
66
778
Prioritize candidate genes in the interacting partners of the disease-related genes•Training sets: disease related genes •Test sets: interacting partners of the training genes
ToppGene Suite – General Schemahttp://
toppgene.cchmc.org
Application Description Input Output
ToppFun Detects functional enrichment of input gene list based on Transcriptome (gene expression), Proteome (protein domains and interactions), Regulome (TFBS and miRNA), Ontologies (GO, Pathway), Phenotype (human disease and mouse phenotype), Pharmacome (Drug-Gene associations), and Bibliome (literature co-citation).
Supported identifiers include NCBI Entrez gene IDs, approved human gene symbols, NCBI Reference Sequence accession numbers;Single gene list.
Html output;Tab-delimited downloadable text file;Graphical charts
ToppGene Prioritize or rank candidate genes based on functional similarity to training gene list.
Same as above but with two gene lists (training and test)
Html output
ToppNet Prioritize or rank candidate genes based on topological features in protein-protein interaction network.
Same as above Html output;Cytoscape compatible input file;Graphical networks
ToppGeNet Identify and prioritize the neighboring genes of the “seeds” in protein-protein interaction network based on functional similarity to the "seed" list (ToppGene) or topological features in protein-protein interaction network (ToppNet).
Single gene list Same as above
ToppGene Suite – Applicationshttp://
toppgene.cchmc.org
Disease Reference Gene ToppGene RankToppNet
Rank
Bipolar Disorder Le-Niculescu et al. KLF12 2 15
Bipolar Disorder Le-Niculescu et al. RORB 4 18
Bipolar Disorder Le-Niculescu et al. RORA 7 13
Bipolar Disorder Le-Niculescu et al. ALDH1A1 10No interaction data
Bipolar Disorder Le-Niculescu et al. AK3L1 11No interaction data
Cardiomyopathy Dhandapany et al. MYBPC3 1 2Celiac Disease Hunt et al. SH2B3 1 8Celiac Disease Hunt et al. CCR3 2 3Celiac Disease Hunt et al. IL18R1 3 29Celiac Disease Hunt et al. RGS1 9 26
Celiac Disease Hunt et al. TAGAP 14No interaction data
Celiac Disease Hunt et al. IL12A 14 10Crohns Disease Fisher et al. MST1 1 27Crohns Disease Fisher et al. NKX2-3 1 27
Crohns Disease Fisher et al. IRGM 2No interaction data
Crohns Disease Villani et al. NLRP3 5 1Crohns Disease Fisher et al. IL12B 7 1
Crohns DiseaseBarrett et al.Franke et al. STAT3 11 1
Crohns Disease Franke et al. PTPN2 30 6Obesity Renstrom et al. MC4R 1 1 Mean 6.8 11.75
Results of the genetic disease prioritizations using ToppGene and ToppNet
Training sets: Compiled using “phenotype/disease” annotations in NCBI’s Entrez Gene records and OMIM
Test set genes: Artificial linkage interval - Candidate gene + 99 nearest neighboring genes based on their genomic distance on the same chromosome.
The gene-disease associations were from recently reported GWAS and include novel disease gene associations.
ToppGene Suite (http://toppgene.cchmc.org)
ToppGene Suite (http://toppgene.cchmc.org)
ToppGene Suite (http://toppgene.cchmc.org)
ToppGene Suite (http://toppgene.cchmc.org)
Why is a test set gene ranked higher?
ToppGene Suite (http://toppgene.cchmc.org)
Part 3Drug Repositioning
What is Drug Repositioning
1. Drug development: It takes about 15 years and $800 million to bring a drug to market!
2. The number of new drugs approved by the FDA each year remains at just 20–30 compounds. At this rate it will take more than 300 years for the number of approved drugs to double!
3. Instead start from existing (already in the market) or failed drugs (late-stage failures – discontinued in development), and test them to uncover new applications.
4. By-pass early stages of drug development required to assess toxicity - Enter clinical trials comparatively quickly
Discovery of novel disease indications for existing drugs
“The most fruitful basis for the discovery of a new drug is to start with an old drug” - Sir James Black, Nobel Laureate, Physiology and Medicine, 1988
1. Because existing drugs have known pharmacokinetics and safety profiles, and are often approved by regulatory agencies for human use, any newly identified use can be rapidly evaluated in phase II clinical trials, which last ~two years and cost much less (~$17 million).
2. In 2008, of the 31 new medicines that reached their first markets, drug repositioning accounted for one-third.
3. Since this strategy is economically more attractive than the de novo drug discovery and development, pharmaceutical and biotech companies have directed their efforts towards it.
Viagra
Rogaine
Topiramate: From epilepsy to obesity
Integrative Functional Genomics
Approaches
PRADAR (Pharmacoinformatics Radar): Pattern Recognition Algorithms for Drug Analysis and
Repositioning
Adverse Drug Reactions – Mouse Phenotype: New Indications?
From serendipity to “systematic serendipity”
PubMed
Medical Informatics
Patient Records
Patient Records
Disease Database
Disease Database
→Name→Synonyms→Related/Similar Diseases→Subtypes→Etiology →Predisposing Causes→Pathogenesis→Molecular Basis→Population Genetics→Clinical findings→System(s) involved→Lesions →Diagnosis→Prognosis→Treatment→Clinical Trials……
Clinical Trials
Clinical Trials
Bioinformatics
Genome
Transcriptome
Proteome
Interactome
Metabolome
Physiome
Regulome Variome
Pathome
Ph
arm
acog
enom
e
Disease
World
OMIM
►Personalized Medicine►Decision Support System►Outcome Predictor►Course Predictor►Diagnostic Test Selector►Clinical Trials Design►Better therapeutics►Hypothesis Generator…..Integrativ
e Genomics
- Biomedic
al Informati
cs
the Ultimate Goal…….