Biological databasesChallenges in organization and usability
Lars Juhl Jensen
Ph.D.
postdoc
staff scientist
group leader
cofounder
challenges
buzzword du jour
big data
semantic web
cognitive computing
Underpants Gnomes
elephant in the room
heterogeneous data
many databases
different formats
different identifiers
variable quality
difficult to interpret
organization
identifier mapping
pick a reference
map all else to that
hard work
database import
automatic updating
separate parsers
error checking
formats change
unstructured data
text mining
dictionary-based methods
co-occurrence statistics
steep learning curve
quality assessment
high error rates
don’t filter it
score it
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
control error rate
improves comparability
helps interpretation
usability
for bioinformaticians
common identifiers
common format
cannot ask for more
for biologists
web interfaces
unified information portal
nobody will use it
focused resources
STRING
protein associations
computational predictions
Korbel et al., Nature Biotechnology, 2004
experimental data
Jensen & Bork, Science, 2008
curated knowledge
Letunic & Bork, Trends in Biochemical Sciences, 2008
text mining
>10 km
general approach
COMPARTMENTS
TISSUES
DISEASES
visualization
quick overview
protein networks
string-db.org
subcellular localization
compartments.jensenlab.org
tissue expression
tissues.jensenlab.org
access to more details
tables are boring
summary
common identifiers
quality scores
focused resources
visualization