the semantic web: new-style data-integration (and how it works for life-scientists too!)
DESCRIPTION
The Semantic Web: New-style data-integration (and how it works for life-scientists too!). Frank van Harmelen AI Department Vrije Universiteit Amsterdam. What’s the problem? (data-mess in bio-inf). Pharmaceutical Productivity. Source: PhRMA & FDA 2003. Kenneth Griffiths and Richard Resnick - PowerPoint PPT PresentationTRANSCRIPT
The Semantic Web:New-style data-integration
(and how it works for life-scientists too!)
Frank van HarmelenAI Department
Vrije Universiteit Amsterdam
What’s the problem?
(data-mess in bio-inf)
Source: PhRMA & FDA 2003
Pharmaceutical Productivity
The Industry’s Problem
Too much unintegrated data:– from a variety of incompatible sources
– no standard naming convention
– each with a custom browsing and querying mechanism (no common interface)
– and poor interaction with other data sources
Kenneth Griffiths and Richard ResnickTut. At Intell. Systems for Molec. Biol., 2003
What are the Data Sources?
• Flat Files• URLs• Proprietary Databases• Public Databases• Data Marts• Spreadsheets• Emails• …
Sample Problem: Hyperprolactinemia
Over production of prolactin– prolactin stimulates mammary gland
development and milk production
Hyperprolactinemia is characterized by:– inappropriate milk production– disruption of menstrual cycle– can lead to conception difficulty
Understanding transcription factors for prolactin production
“Show me all genes in the public literature that are putatively related to hyperprolactinemia, have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells, and are homologous to known transcription factors.”
“Show me all genes that are homologous to known transcription factors”
SEQUENCE
1Q“Show me all genes that have more than 3-fold expression differential between hyperprolactinemic and normal pituitary cells”EXPRESSION
2Q
“Show me all genes in the public literature that are putatively related to hyperprolactinemia”
LITERATURE
3Q
(Q1Q2Q3)
The Medical tower of Babel Mesh
Medical Subject Headings, National Library of Medicine 22.000 descriptions
EMTREE Commercial Elsevier, Drugs and diseases 45.000 terms, 190.000 synonyms
UMLS Integrates 100 different vocabularies
SNOMED 200.000 concepts, College of American Pathologists
Gene Ontology 15.000 terms in molecular biology
NCI Cancer Ontology: 17,000 classes (about 1M definitions),
Stitching this all together by hand?
Source: Stephens et al. J Web Semantics 2006
Why would Semantic technology
help?
machine accessible meaning (What it’s like to be a machine)
<name>
<symptoms>
<drug>
<drugadministration>
<disease>
<treatment>
IS-A
alleviatesMETA-DATA
What is meta-data?
it's just datait's data describing other dataits' meant for machine consumption
disease
name
symptoms
drug
administration
Required are:1. one or more standard vocabularies
so search engines, producers and consumersall speak the same language
2. a standard syntax, so meta-data can be recognised as such
3. lots of resources with meta-data attached mechanisms for attribution and trust
is this page really about Pamela Anderson?
no shared understanding
Conceptual and terminological confusion
Actors: both humans and machines
Agree on a conceptualization
Make it explicit in some language.
world
concept
language
What are ontologies &what are they used for
standard vocabularies (“Ontologies”)Identify the key concepts in a domainIdentify a vocabulary for these
conceptsIdentify relations between these
conceptsMake these precise enough
so that they can be shared between humans and humans humans and machines machines and machines
Biomedical ontologies (a few..) Mesh
Medical Subject Headings, National Library of Medicine 22.000 descriptions
EMTREE Commercial Elsevier, Drugs and diseases 45.000 terms, 190.000 synonyms
UMLS Integrates 100 different vocabularies
SNOMED 200.000 concepts, College of American Pathologists
Gene Ontology 15.000 terms in molecular biology
NCBI Cancer Ontology: 17,000 classes (about 1M definitions),
Remember “required are”: one or more standard vocabularies
so search engines, producers and consumersall speak the same language
2. a standard syntax, so meta-data can be recognised as such
3. lots of resources with meta-data attached
Stack of languages
Stack of languagesXML:
Surface syntax, no semanticsXML Schema:
Describes structure of XML documentsRDF:
Datamodel for “relations” between “things”RDF Schema:
RDF Vocabular Definition LanguageOWL:
A more expressive Vocabular Definition Language
Remember “required are”: one or more standard vocabularies
so search engines, producers and consumersall speak the same language
a standard syntax, so meta-data can be recognised as such
3. lots of resources with meta-data attached
Question: who writes the ontologies?Professional bodies, scientific
communities, companies, publishers, ….
See previous slide on Biomedical ontologies Same developments in many other fields
Good old fashioned Knowledge Engineering
Convert from DB-schema, UML, etc.
Question:Who writes the meta-data ?
- Automated learning- shallow natural language analysis- Concept extraction
amsterdam
trade
antwerp europe
amsterdam
merchant
city town
center
netherlandsmerchant
city town
Example: Encyclopedia Britannica on “Amsterdam”
exploit existing legacy-data Databases Lab equipment (Amazon)
side-effect from user interaction email keyword extraction
NOT from manual effort
Question:Who writes the meta-data ?
Remember “required are” one or more standard vocabularies
so search engines, producers and consumersall speak the same language
a standard syntax, so meta-data can be recognised as such
lots of resources with meta-data attached
Some working examples?• DOPE
DOPE: BackgroundVertical Information Provision
Buy a topic instead of a Journal ! Web provides new opportunities
Business driver: drug development Rich, information-hungry market Good thesaurus (EMTREE)
The Data Document repositories:
ScienceDirect: approx. 500.000 fulltext articles
MEDLINE: approx. 10.000.000 abstracts
Extracted Metadata The Collexis Metadata Server: concept-
extraction ("semantic fingerprinting")
Thesauri and Ontologies EMTREE:
60.000 preferred terms 200.000 synonyms
RDF Schema
EMTREE
Queryinterface
RDF
Datasource 1
RDF
Datasource n….
Architecture:
Ontology disambiguates
query
Ontology groups results
Ontology clusters results
Ontology refinesquery
Some working examples?
• DOPE• HCLS (http://www.w3.org/2001/sw/hcls/)
RDF Schema
EMTREE
Queryinterface
RDF
Datasource 1
RDF
Datasource n….
Architecture:
RDF Schema
Gene Ontology ….
Summarising… Data integration on the Web:
machine processable data besides human processable data
Syntax for meta-data Representation Inference
Vocabularies for meta-data Lot’s of them in bio-inf.
Actual meta-data: Lot’s in bio-inf.
Will enable: Better search engines (recall, precision, concepts) Combining information across pages (inference) …
Things to do for you Practical:
Use existing software to construct new use-scenario’s
Conceptual:Create on ontology for some area of bio-medical expertise
from scratch as a refinement of an existing ontology
Technical:Transform an existing data-set in meta-data format, and provide a query interface (for humans and machines)