Prioritization of targets for Prioritization of targets for Structural GenomicsStructural Genomics
Peer Bork
EMBL & MDC
Heidelberg & Berlin
[email protected]://www.bork.embl-heidelberg.de/
www.bork.embl-heidelberg.de
Prioritising targets forPrioritising targets forStructural GenomicsStructural Genomics
Homology-based coverageHomology-based coverage
Complexes and functional modules Complexes and functional modules
Candidates for complex diseases Candidates for complex diseases
Associating genes to diseases Associating genes to diseases
time
cove
rage
Intellectual challenge
Human, <500aa, annotated in sequence databases:Human, <500aa, annotated in sequence databases:
32349
Xray selection protocol (Oct 1999)Xray selection protocol (Oct 1999)
20724
Filter for 98% redundancy, splice forms, fragments:Filter for 98% redundancy, splice forms, fragments:
Match to clones available at German resource center:Match to clones available at German resource center:
ProteinsProteinsFiltersFilters
6016EST match protein in N-terminal region:EST match protein in N-terminal region:
4755
Proteins have no homologue with known 3D (fast check):Proteins have no homologue with known 3D (fast check):
Distinct expression protocolDistinct expression protocol
1827
1102
ProteinsProteinsFiltersFilters
602602
…….Proteins have no homologue with known 3D (fast check):.Proteins have no homologue with known 3D (fast check):
No transmembrane region or other composition bias:No transmembrane region or other composition bias:
Proteins have no homologue with known 3D (sensitive check):Proteins have no homologue with known 3D (sensitive check):
602602347 255
Functional featuresFunctional features
knownknown unknownunknown
Medical relevance likelyMedical relevance likely71
Distinct NMR protocolDistinct NMR protocol
Xray selection protocol (Oct 1999)Xray selection protocol (Oct 1999)
www.bork.embl-heidelberg.de
Criteria for target selection from sequenceCriteria for target selection from sequence
No similar sequence with known fold Everything that crystallizes in a given species Everything from certain pathways/ complexes/
compartments Everything with certain properties (e.g. thermophilic,
kinase-function) ‘All’ disease genes Everything else left over ….
www.bork.embl-heidelberg.de
Structural Biology and Structural Biology and BioinformaticsBioinformatics
Target prediction for Target prediction for structural genomicsstructural genomics
Zooming out: Protein interactions Zooming out: Protein interactions
Zooming in: SNPs and 3D structures Zooming in: SNPs and 3D structures
Target prediction for Target prediction for structural genomicsstructural genomics
Zooming out: Protein interactionsZooming out: Protein interactions
www.bork.embl-heidelberg.de
Rich Copley
Berend Snel
+Martijn Huynen
Interaction Interaction predictionprediction
www.bork.embl-heidelberg.de
Function prediction via Function prediction via genomic context informationgenomic context information
Gene context:Gene context:
- Pathway data (can overrule homology!)- Gene expression data (co-expression etc.)- Protein interaction /localisation - Scientific literature
- Gene fusion as distinct neighborhood subset - Conserved gene neighborhood in genomes - Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’)- Surrounding and shared regulatory elements
Knowledge-based context:Knowledge-based context:
www.bork.embl-heidelberg.de
Context methods in Mycoplasma: Context methods in Mycoplasma: Fusion, neighborhood, co-occurrenceFusion, neighborhood, co-occurrence
MG total:MG total:480 genes480 genes
Presence in conserved Presence in conserved operons: 213operons: 213
Conserved Conserved neighborhoodneighborhood
27
54
FusionFusion
Co-occurrenceCo-occurrencein genomesin genomes
178
STRING server for context retrievalSTRING server for context retrieval
Tryptophan Tryptophan biosynthesisbiosynthesis
ww
w.bork.em
bl-heidelberg.de/STRIN
Gw
ww
.bork.embl-heidelberg.de/STR
INGw
ww
.bor
k.em
bl-h
eide
lber
g.de
/STR
ING
ww
w.b
ork.
embl
-hei
delb
erg.
de/S
TRIN
G
www.bork.embl-heidelberg.de
Gene neighborhood reflects connections between Gene neighborhood reflects connections between Tryptophan and Shikimate biosynthesisTryptophan and Shikimate biosynthesis
www.bork.embl-heidelberg.de
hemK
tyrA
aroB
aroEaroC
asdtruA
hyp
hyp
2c-rr
trpF
trpC
trpAtrpB
trpDtrpG
trpE
Modularity in “genomic association space” Modularity in “genomic association space”
Tryptophan synthesis pathway
Shikimate pathway
Networks based on conserved gene neighborhood reveal ‘natural’ subsystems
www.bork.embl-heidelberg.de
Applications of interaction predictionsApplications of interaction predictions
3885 interactions (involving 1995 genes) predicted based on genomic context, 27% overlap, complementary
Goal: Functional characterization of all multiprotein assemblies as fast as possible (2001), move to human
The Cellzome* yeast factory
Methods: TAP tagging/co-purification + mass-spec
Results based on ca 1400 human orthologues: 1700 genes in 230 complexes, ca 130 of them novel
*proteomics company founded at EMBL in June 2000, curr. >90 employeesData provided by
www.bork.embl-heidelberg.de
Predicting candidate genes for Predicting candidate genes for genetically inherited diseasesgenetically inherited diseases
Association of genes to diseases Association of genes to diseases
Analysis of non-synonymous SNPs Analysis of non-synonymous SNPs
Association of genes to diseases Association of genes to diseases
Analysis of non-synonymous SNPs Analysis of non-synonymous SNPs
www.bork.embl-heidelberg.deShamil Sunyaev
Growth of known 3D-structures
0
5000
10000
15000
1995 1996 1997 1998 1999 2000
Years
Nu
mb
er o
f P
DB
en
trie
s
Growth of the number of complete genomes
010203040
1995 1996 1997 1998 1999 2000
Years
Nu
mb
er o
f g
eno
mes
Growth of SNP data
0,00E+005,00E+051,00E+061,50E+062,00E+062,50E+06
1998
(3)
1998
(4)
1999
(1)
1999
(2)
1999
(3)
1999
(4)
2000
(1)
2000
(2)
2000
(3)
2000
(4)
Quarters of a year
Nu
mb
er o
f S
NP
su
bm
issi
on
sSNP data have SNP data have currently fastest currently fastest growth rategrowth rate
Integration with other data is the key to more understanding
SNPs and mutationsSNPs and mutations90% of human genetic variation due to single 90% of human genetic variation due to single
nucleotide polymorphism (SNP)nucleotide polymorphism (SNP)••mapping toolmapping tool••association with complex phenotypesassociation with complex phenotypes (multifactorial diseases/ drug responses etc.)(multifactorial diseases/ drug responses etc.)••human evolutionhuman evolution
cSNPcSNP - SNP in coding region - SNP in coding regionnonsynonymousnonsynonymous SNP SNP - affects amino acid sequence- affects amino acid sequence
SNPSNP - allele frequency >1% - allele frequency >1%Disease mutationDisease mutation - usually allele frequency <<1% - usually allele frequency <<1%
ESTs reveal SNPs and alternative splice sites...ESTs reveal SNPs and alternative splice sites...
mRNmRNAA
3’ UTR3’ UTRcodingcoding5’ UTR5’ UTR
EST1EST1
AA
AATTTT
CC
AA SNP predictionSNP prediction
Prediction of Prediction of alternative splicingalternative splicing
EST2EST2EST3EST3EST4EST4
EST5EST5EST6EST6
……but also lots of errors!!!but also lots of errors!!!
(>700 libraries!)(>700 libraries!)
(many different tissues (many different tissues and age groups!)and age groups!)
www.bork.embl-heidelberg.de
Mapping SNPs onto 3D: Mapping SNPs onto 3D: Identifying those that damage proteinsIdentifying those that damage proteins
Rules taken from protein engineeringand multiple sequence analysis
www.bork.embl-heidelberg.de
Selected polymorphic sites mapped onto 3DSelected polymorphic sites mapped onto 3D
High ( 5%)
Minor allelefrequency:
Low (<5%)
Selection of Mutations for 3D mappingSelection of Mutations for 3D mapping
SWISSPROT
Data sourcesData sources
OMIM
HGBASE
Chakravati WEB
HSSP
FilterFilterKeywords: ‘3D STRUCTUREand ‘DISEASE MUTATION’
Resulting 3 setsResulting 3 sets
Keywords: ‘3D STRUCTURE’and ‘POLYMORPHISM’ but not‘DISEASE MUTATION’
Allelic variants with frequency >1% in a pool of ‘normal’ individuals
Blastx search against PDB
Check all proteins identified in the resulting sets above for close homologues in other species (>90% identity) and take mutations
11. 551disease mutations. 551disease mutations ((badiesbadies))
22. 86 allelic variants. 86 allelic variants (‘(‘don’t know’don’t know’))
33. 225 and 261 neutral . 225 and 261 neutral mutations between species mutations between species ((goodiesgoodies) in proteins of set ) in proteins of set 11 and and 22, respectively, respectively
www.bork.embl-heidelberg.de
How many sites are in structurally and How many sites are in structurally and functionally “important” regions?functionally “important” regions?
Disease mutation sites (badies) 90%
Polymorphic sites (don’t know) 29%
Interspecies mutations (goodies) 8%
Hence: Predicting phenotypic effects of cSNPs!Hence: Predicting phenotypic effects of cSNPs!
‘important’=surface accessibility <10%, active site, S-S bond
Sunyaev/Ramensky/Bork, Trends Genet. 16(00)191Sunyaev/Ramensky/Koch/Lathe/Bork, unpubl.
PredictionPrediction of risk factors of risk factors
GeneGene Disease riskDisease risk Frequency Frequency Mutation effectMutation effect
HFEHFE hemochromatosis hemochromatosis 6% 6% destroyed SS-bonddestroyed SS-bondFructose-Fructose- fructose intolerance fructose intolerance >1% >1% destroyed coredestroyed corebiphosphate biphosphate aldolasealdolaseNAD(P)H NAD(P)H benzene toxicity benzene toxicity 4-20% 4-20% Unfavorable Unfavorabledehydrogenase dehydrogenase (post-chemotherapy (post-chemotherapy substitution substitution
leukemia)leukemia)-1-anti--1-anti- familial obstructive familial obstructive >1% >1% destroyed core destroyed corechymotrypsin chymotrypsin lung disease lung disease-1-antitrypsin-1-antitrypsin emphysema emphysema 2-4% 2-4% destroyed core destroyed core
Of 36 SNPs with predicted phenotypic effects Of 36 SNPs with predicted phenotypic effects (from a well-characterized SNPs pool), 5 are (from a well-characterized SNPs pool), 5 are already known to be disease-associated:already known to be disease-associated:
www.bork.embl-heidelberg.de
Structural Biology and Structural Biology and BioinformaticsBioinformatics
Zooming out: Protein interactions Zooming out: Protein interactions
Zooming in: SNPs and 3D structures Zooming in: SNPs and 3D structures
Target prediction for Target prediction for structural genomicsstructural genomics
Zooming out: Protein interactionsZooming out: Protein interactions
Zooming in: SNPs and 3D structures Zooming in: SNPs and 3D structures
www.bork.embl-heidelberg.de
Credits g2D
www.bork.embl-heidelberg.dewww.bork.embl-heidelberg.de
Carolina Perez Miguel Andrade
MEDLINEMEDLINE
MeSH C MeSH D
article
phenotype chemistry
RefSeqRefSeq
Gene Ontology
article
Gene biochemistry
gene10 725 796 articles
6 023 924 pairs 98 969 pairs
10 329 sequences
6 992 terms 5 070 terms 2 379 terms
Literature mining for associatingLiterature mining for associatinggenotypes to phenotypesgenotypes to phenotypes
Phenotype C MeSH
Acidosis, Renal Tubular
Acidosis
Hypokalemia
Nephrocalcinosis
Sjogren’s Syndrome
Alkalosis
Kidney Diseases
Kidney Failure, Chronic
Nephritis, Interstitial
Fanconi Syndrome
…
GO Gene Ontology
Carbonate dehydratase
Hydrogen-transporting ATP syntase
Hydrogen/potassieum-exchanging ATPase
Hydrogen-transporting two-sector ATPase
Proton transport
Vacuolar hydrogen-transporting ATPase (synonim: VATPase)
Pyruvate carboxylase
Aminobutyrate catabolism
Succinate-semialdehyde dehydrogenase
…
D MeSH
MEDLINE RefSeq
LocusLink
Golden Path
II
I
7q33-q34
etc...
Association to Craniofrontonasal dysplasia
•Receptors, Fibroblast Growth Factor
Craniosynostoses [15]
Craniofacial Dysostosis [7]
Mental Retardation [4]
•0.0130 fibroblast growth factor receptor (function) MeSH
C
MeSH D
GO
Hypertelorism [6]
Bone Diseases, Developmental [3]
0.0905
0.0526
0.0014
0.0112
0.2500
•0.0241 FGF receptor signaling pathway (process)
0.4615
0.0285
•0.0061 MAPKKK cascade (process)
0.1176
•Fibroblast Growth Factor
0.0092
0.0058
0.0010
0.0588
•DNA probes0.0109
•0.0001 integral plasma membrane protein (component)
0.0153
•Chondroitin
0.0075 0.0046
•0.0011 skeletal development (process)
•Keratan sulfate0.0052
0.0119•Collagen
0.0546
0.0434
0.0215
•Bone morphogenetic proteins
0.0032
0.0017 0.0531
•0.0000 signal transduction (process)
0.0092
0.0322
symptoms,manifestations
chemicals,proteins, drugs
functions
0.0123 NP_002002 fibroblast growth factor receptor 4, isoform 1 precursor - Human0.0130 fibroblast growth factor receptor (function) 0.0241 FGF receptor signaling pathway (process) 0.0000 integral plasma membrane protein (component)
0.0083 NP_006644 suc1-associated neurotrophic factor target 2 - Human0.0000 signal transduction (process) 0.0241 FGF receptor signaling pathway (process)0.0009 peripheral plasma membrane protein (component)
0.0075 NP_000595 fibroblast growth factor receptor 1, isoform 1 precursor - Human0.0130 fibroblast growth factor receptor (function) 0.0061 MAPKKK cascade (process) 0.0011 skeletal development (process) 0.0007 oncogenesis (process) 0.0241 FGF receptor signaling pathway (process) 0.0000 integral plasma membrane protein (component)
0.0026 NP_034336 fibroblast growth factor receptor 1 - Mouse0.0000 ATP binding (function)0.0000 membrane fraction (component) 0.0000 signal transduction (process) 0.0000 protein tyrosine kinase (function) 0.0130 fibroblast growth factor receptor (function)
band Xp22
chromosome X
RefSeq
...
From homology to disease associationFrom homology to disease association
GO-scores
0
10
20
30
40
50
60
70
80
1.E-04 1.E-03 1.E-02 1.E-01 1.E+00
Log R-score
Benchmark of 100 disease genesBenchmark of 100 disease genesR
ank
of tr
ue g
ene
Score correlates with prediction accuracyScore correlates with prediction accuracy
bench 10
bench 100
not annotated