the string database - quality scores for heterogeneous interaction data
DESCRIPTION
Lyon, France, April 23-25, 2007TRANSCRIPT
The STRING databaseQuality scores for heterogeneous interaction data
Lars Juhl Jensen
EMBL Heidelberg
data integration
Jensen et al., Drug Discovery Today: Targets, 2004
functional interactions
Genomic neighborhood
Species co-occurrence
Gene fusions
Database imports
Experimental interaction data
Microarray expression data
Literature mining
373 proteomes
model organism databases
Ensembl
Genome Reviews
RefSeq
genomic context methods
gene fusion
gene neighborhood
phylogenetic profiles
scoring schemes
benchmarking
cross-species transfer
primary experimental data
many sources
different formats
different gene identifiers
redundancy
physical protein interactions
IntAct
BINDBiomolecular Interaction Network Database
MINTMolecular Interactions Database
DIPDatabase of Interacting Proteins
GRIDGeneral Repository for Interaction Datasets
HPRDHuman Protein Reference Database
PSI-MI
reference proteomes
merge data by publication
thousands of interactions
correct interactions
wrong interactions
scoring scheme
complex pull-down
von Mering et al., Nucleic Acids Research, 2005
log[(N12·N)/((N1+1)·(N2+1))]
yeast two-hybrid
non-shared interactors
-log((N1+1)·(N2+1))
not directly comparable
calibrate vs. gold standard
other types of evidence
co-expression
GEOGene Expression Omnibus
species-specific datasets
correlation coefficient
calibrate vs. gold standard
directly comparable
curated knowledge
many sources
different formats
different gene identifiers
redundancy
protein complexes
MIPSMunich Information center
for Protein Sequences
Gene Ontology
pathway databases
KEGGKyoto Encyclopedia of Genes and Genomes
Reactome
PIDNCI-Nature Pathway Interaction Database
STKESignal Transduction Knowledge Environment
BioPAX
reference proteomes
literature mining
MEDLINE
SGDSaccharomyces Genome Database
The Interactive Fly
OMIMOnline Mendelian Inheritance in Man
different gene identifiers
synonyms lists
black list
flexible matching
co-occurrence
log[(N12·N)/((N1+1)·(N2+1))]
NLPNatural Language Processing
Gene and protein namesCue words for entity recognitionVerbs for relation extraction
[nxgene The GAL4 gene]
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
calibrate vs. gold standard
directly comparable
combine all evidence
spread over many species
transfer by orthology
von Mering et al., Nucleic Acids Research, 2005
two modes
orthologous groups
von Mering et al., Nucleic Acids Research, 2005
fuzzy orthology
von Mering et al., Nucleic Acids Research, 2005
add probabilistic scores
P = 1-(1-P1).(1-P2).(1-P3)…
Genomic neighborhood
Species co-occurrence
Gene fusions
Database imports
Experimental interaction data
Microarray expression data
Literature mining
Acknowledgments
The STRING team– Christian von Mering
– Michael Kuhn
– Berend Snel
– Martijn Huynen
– Sean Hooper
– Samuel Chaffron
– Julien Lagarde
– Mathilde Foglierini
– Peer Bork
Literature mining project– Jasmin Saric
– Rossitza Ouzounova
– Isabel Rojas