knowledge assembly at scale with semantic and probabilistic techniques
Post on 15-Apr-2017
379 Views
Preview:
TRANSCRIPT
Knowledge Assembly at Scale
with Semantic and Probabilistic Techniques
Szymon Klarman
Department of Computer Science Brunel University London
Connected Data London 2016
Scientific publishing deluge
50 mln papers published since 1665
2.5 mln papers published last year
publication output doubling every 9 years
Effects:
narrowing of science and scholarship – we cite a small pool of
mostly recent papers
narrowing of expertise
„publish or perish” principle affects the quality of results
Challanges
• ambiguity and vagueness of natural language
• general quality and reliability of the sources
• the inaccuracy of the information extraction tools
• the typical „Vs” of the big data, i.e.: volume, variety, volatility, velocity
• inconsistent, inconclusive or non-reproducible results
• gaps, omissions, contextual assumptions
In vitro curcumin downregulated the expression of Bcl-2, and Bcl-XL and upregulated the expression of
p53, Bax, Bak, PUMA, Noxa, and Bim at mRNA and protein levels in prostate cancer cells [14].
extraction
reconciliation
filtering
aggregation
evidence knowledgemodel formation
Knowledge assembly is a process of reconstructing complex knowledge from contextually
asserted atomic statements and data fragments (evidence).
Knowledge assembly
knowledge assembly„[…] A can associate with B […]” <A binding B>
extraction assemblyevidence (probabilistic)
knowledge
probabilistic inference
learning
model updates
Probabilistic knowledge assembly
expert input
In Probabilistic Knowledge Assembly (PANDA) framework, evidence with all contextual
information is part of the knowledge base to enable continuous update-assembly loop.
extraction assemblyevidence (probabilistic)
knowledge
probabilistic inference
learning
model updates
„A can associate with B”extraction acurracy = 0.7
published in: „Molecular Cancer”<A binding B> is supported to degree 0.7 Evidence contradicts the model to degree 0.7
<A binding B> is experimentally confirmed
Probabilistic knowledge assembly
expert input
In Probabilistic Knowledge Assembly (PANDA) framework, evidence with all contextual
information is part of the knowledge base to enable continuous update-assembly loop.
ontologies:
• biomedical (GO, BioPax, MI)
• uncertainty (UNO)
• information/document/provenance description
(IAO, Prov-O, VoID, Dublin Core)
(linked) open data via SPARQL endpoints and APIs:
• PubMed
• journal rankings (SciMago)
• bioinformatics databases (UniProt, Chebi, HGNC)
unique identifiers
• biochemical enitities
• journals / articles
Linked data resources
Event
Biochemical entity / Event
Statement
ArticleJournal
represents
is extracted from
Molecular interaction
has participant
type
published in
Uncertainty level
Textual evidence
Truth value evidence
has evidence
has truth value
has uncertainty
(of type X)
Knowledge graph: data model
knowledge
statement_1
textual
evidence
0.8
extraction prob
True
truth value
PMC123456
extracted from
„In addition, GRB2 can
associate with GAB1”
Statement
Article
type
type
0.7
provenance prob
[...]
In addition, GRB2
can associate with
GAB1
[...]
Knowledge graph: example
GRB2 binding GAB1
statement_1
textual
evidence
0.8
extraction prob
GRB2_MOUSE GAB1_MOUSE
has participant A has participant B
True
truth value
PMC123456
extracted from
„In addition, GRB2 can
associate with GAB1”
Event
Binding
Protein
Statement
Article
type
type
subclass of
typetype
type
represents0.7
provenance prob
[...]
In addition, GRB2
can associate with
GAB1
[...]
GRB2 binding GAB1
statement_1
textual
evidence
0.8
extraction prob
statement_..99
represents
GRB2_MOUSE GAB1_MOUSE
has participant A has participant B
True
truth value
PMC123456
extracted from
„In addition, GRB2 can
associate with GAB1”
Event
Binding
Protein
Statement
Article
PMC654321 False
„GRB2 does not interact
directly with GAB1”
typetype
type
subclass of
typetype
type type
represents
extractedFrom
0.7
provenance prob
0.6
0.7provenance prob
extraction prob
textual
evidence
truth value
GRB2 binding GAB1
statement_1
textual
evidence
0.8
extraction prob
statement_..99
represents
GRB2_MOUSE GAB1_MOUSE
has participant A has participant B
True
truth value
PMC123456
extracted from
„In addition, GRB2 can
associate with GAB1”
Event
Binding
Protein
Statement
Article
PMC654321 False
„GRB2 does not interact
directly with GAB1”
typetype
type
subclass of
typetype
type type
represents
extractedFrom
0.7
provenance prob
0.6
0.7provenance prob
extraction prob
textual
evidence
truth value
So what can we really say about
the truth of events?
event = <A binding B>
0
0,5
1
{s1} {s1, s2} {s1, s2, s3}
positive support
negative support
inconsistency
Statement Extraction accurracy Provenance uncertainty
S1 = event is true 0.8 0.7
S2 = event is false 0.8 0.7
S3 = event is false 0.9 0.6
Support aggregation
Positive
support
Negative
support
Event
likelihood
Doc_1
Doc_2
Stat_1
Stat_2
Provenance
uncertaintyExtraction
accurracy
Textual
uncertaintyStat...
Doc...
Document
part weight
Total uncertainty aggregationProbabilistic model (~Bayes net) over linked data expressed via probabilistic logic
programming (ProbLog).
Extraction Accuracy
Provenance Uncertainty
Total Uncertainty
ExperimentalConfirmation
T F -
0.9 0.1 0.5
Molecule Interaction GeneTotal Uncertainty
Before ExperimentExperimental Confirmation
Total UncertaintyAfter Experiment
curcuminnegative
regulationBCL2_MOUSE 0.3941 TRUE 0.7489
curcuminpositive
regulationP53_HUMAN 0.3924 FALSE 0.1569
curcuminnegative
regulationQ9H014_HUMAN 0.3929 - 0.3929
... ... ... ... ... ...
Expert input
Big Mechanism technology
We need to find generic solutions for extracting Big Mechanisms and enabling them to
computational agents.
Probabilistic Knowledge Assembly framework (semantics + probabilistic reasoning) offers:
• a powerful framework for scalable and flexible knowledge assembly tasks
• a uniform knowledge representation model and data access interface based on generic
tools and technologies (particularly W3C standards)
• the use of declarative formalisms facilitates provenance tracking
• continuous update-assembly loop for dynamic environments
top related