the empirical turn in knowledge representation

45
Creative Commons CC BY 3.0: allowed to share & remix (also commercial) but must attribute Frank van Harmelen The empirical turn in Knowledge Representation Contributions from many people in the KR&R group over many years. And thanks to NWO for a 750k€ TOP grant for this

Upload: frank-van-harmelen

Post on 21-Jan-2018

602 views

Category:

Science


0 download

TRANSCRIPT

Page 1: The Empirical Turn in Knowledge Representation

Creative Commons CC BY 3.0:

allowed to share & remix

(also commercial)

but must attribute

Frank van Harmelen

The empirical turn in

Knowledge Representation

Contributions from many peoplein the KR&R group over many years.

And thanks to NWO for a 750k€ TOP grant for this

Page 2: The Empirical Turn in Knowledge Representation

KR in the pre-empirical era

Page 3: The Empirical Turn in Knowledge Representation

Handbook of Knowledge Representation(1000 pages, ToC alone is 14 pages)

• propositional logic & satisfiability solvers

• first order logic & resolution

• description logic

• constraint (logic) programming

• nonmonotonic reasoning

• belief revision

• qualitative reasoning

• model-based diagnosis

• bayesian networks

• temporal logic

• spatial reasoning

• epistemic logic

• deontic logic

• situation calculus

• default logic

• event calculus• ……

Page 4: The Empirical Turn in Knowledge Representation

KR metrics in the pre-empirical era

KR = logic• Show small examples

• Prove properties(expressivity, complexity)

• Give algorithms(sound, complete)

KR = engineering• Build applications

• Show high performance

• Show low engineering costs

Page 5: The Empirical Turn in Knowledge Representation

BUT AN EXPERIMENTIN THE PAST 10 YEARS

MADE IT POSSIBLE TO DO SOMETHING VERY DIFFERENT:

OBSERVE HOWKNOWLEDGE REPRESENTATIONS BEHAVE

AT VERY LARGE SCALE

Page 6: The Empirical Turn in Knowledge Representation
Page 7: The Empirical Turn in Knowledge Representation

Rest of the talk

• Which KR’s were part of the experiment?

• How much of it was there to observe?

• How did we manage to observe it?

• What did we learn from observing it?

Page 8: The Empirical Turn in Knowledge Representation

Which KR’s ?

Page 9: The Empirical Turn in Knowledge Representation

RDF (for non-logicians)

Page 10: The Empirical Turn in Knowledge Representation

RDF (for logicians)

• ground binary predicate: 𝑃(𝑂1, 𝑂2)

• Limited existential variables: ∃𝑥: 𝑃 𝐶1, 𝑥 ∧ 𝑃 𝐶2, 𝑥

• Type is unary predicate: 𝑇𝑖 𝑥

• Subtypes ∀𝑥: 𝑇1 𝑥 → 𝑇2(𝑥)

• Type restrictions ∀𝑥, 𝑦: 𝑃 𝑥, 𝑦 → 𝑇1 𝑥 ∧ 𝑇2(𝑦)

• Equality: 𝑂1= 𝑂2• Extensions to DL:

– Distjointness of types

– Cardinality restrictions (0,1)

– always decidable: sub-FOL.

Page 11: The Empirical Turn in Knowledge Representation

RDF deduction

Page 12: The Empirical Turn in Knowledge Representation

OWL Semantics

Page 13: The Empirical Turn in Knowledge Representation

How much is there to observe?

Page 14: The Empirical Turn in Knowledge Representation

± 45-100 billion facts

Page 15: The Empirical Turn in Knowledge Representation

1 fact

How big is 100 billion

Page 16: The Empirical Turn in Knowledge Representation

Denny Vrandečić – AIFB, Universität Karlsruhe ≈ 1 fact per web-page

100 billion golfballs ≈ Jupiter

Page 17: The Empirical Turn in Knowledge Representation

x T

[<x> IsOfType <T>]

differentowners & locations

< analgesic >

BTW: How did it get so big?

On the Web, anybody can say anything about anything

Page 18: The Empirical Turn in Knowledge Representation

BTW: How did it get so big?

On the Web, anybody can say anything about anything

x T

R

Page 19: The Empirical Turn in Knowledge Representation

How did you manage to observe it?

Page 20: The Empirical Turn in Knowledge Representation
Page 21: The Empirical Turn in Knowledge Representation
Page 22: The Empirical Turn in Knowledge Representation

LOD LaundromatBeek & Rietveld et al. 2014, LOD laundromat: a uniform way of publishing other people's dirty datahttp://lodlaundromat.org/pdf/lodlaundry.pdf

HDTFernández & Martínez-Prieto & Gutiérrez, 2013, Binary RDF representation for publication and exchange (HDT)

LDFVerborgh & Vander Sande et al. 2014, Web-Scale Querying through Linked Data Fragments

Page 23: The Empirical Turn in Knowledge Representation

LOD-a-lothttp://lod-a-lot.lod.labs.vu.nl/

Page 24: The Empirical Turn in Knowledge Representation

Surprisingly efficient

1 file

28,362,198,927 unique triples

>650K data documents

524 GB of disk space

16 GB of RAM

Only €305,- hardware cost

Meta-Data for a lot of LODhttp://www.semantic-web-journal.net/content/meta-data-lot-lod-2

Page 25: The Empirical Turn in Knowledge Representation

Statistics (boring)

triples 28,362,198,927

subject 3,214,347,198

predicates 1,168,932

objects 3,178,409,386

literals 5.3B

Page 26: The Empirical Turn in Knowledge Representation

Re-use is fairly high… or not…

Page 27: The Empirical Turn in Knowledge Representation

Analysing Logical identity

Joe Raad Wouter BeekESWC2018, under submission

Page 28: The Empirical Turn in Knowledge Representation

Identity clusters

LOD-a-lot Filehttp: //lod-a-lot.lod.labs.vu.nl

[Fernández 2017]

558 millions owl:sameAs (309 millions distinct terms)

≈ 4 hours

1. Extracting all owl:sameAs statements on the LOD

HDT File(4.5 GB)

Page 29: The Empirical Turn in Knowledge Representation

HDT File(4.5 GB)

IdentityClosure

1

IdentityClosure

2

IdentityClosure

89 387 082…

- The largest Identity Closure contains 177 794 terms(contains all the countries in the world, Albert Enstein, « empty string », etc.)

- The smallest Identity Closure contains 2 terms

x owl:sameAs y z owl:sameAs y

Identity Closure x y z

2. Generating the Identity Closure

Page 30: The Empirical Turn in Knowledge Representation
Page 31: The Empirical Turn in Knowledge Representation

Identity Closure « Cities »

3. Detecting Communities (using the Louvain Algorithm)

This network (i.e. identity closure) has a community structure, as it can be grouped into different sets of nodes, with each set of nodes being densely connected internally.

Goal: Find (and later Evaluate) the most “suspicious” identity links (i.e. the links between different communities)

Page 32: The Empirical Turn in Knowledge Representation

4. Application: debugging identity statements

Identity closure containing the term

“dbpedia.org/page/Barack_Obama”

This Identity Closure contains 388 terms (i.e. 387 distinct terms are owl:sameAs this term)

95 communities detectedlargest community = 99 terms

Page 33: The Empirical Turn in Knowledge Representation

4. Application: debugging identity statements

comm0

comm3

2 links

Community 0

1. dbpedia.org/resource/B_hussein_obama2. dbpedia.org/resource/Barack_H_Obama,_Jr3. dbpedia.org/resource/Barak_hussein_obama4. dbpedia.org/resource/President_Barack5. dbpedia.org/resource/Senator_Barack_Obama6. dbpedia.org/resource/Obama

99. dbpedia.org/resource/Hussein_Obama

Community 3

1. dbpedia.org/resource/Presidency_of_Barack_Obama2. dbpedia.org/resource/Barack_Obama_Administration3. dbpedia.org/resource/Barack_Obama_Cabinet4. dbpedia.org/resource/Obama_White_House5. dbpedia.org/resource/Obama_regime6. dbpedia.org/resource/America_under_Obama

52. dbpedia.org/resource/Presidential_transition_of_Barack_Obama

Page 34: The Empirical Turn in Knowledge Representation

Symbols or words?

Steven de Rooij Peter Bloem Wouter Beek (ISWC 2016)http://www.cs.vu.nl/~frankh/postscript/ISWC2016.pdf

Page 35: The Empirical Turn in Knowledge Representation

Symbols or words?

Symbol names are supposed to be meaningless

Aspirin headache

analgesic pain

symptomdrug

treats

treats

Page 36: The Empirical Turn in Knowledge Representation

Measure mutual information content between string and semantics of a symbol

E(x) = efficient encoding of x

Mutual information content

M(x,y) =E(x) + E(y) – E(x,y)

Take x = symbol name of x as a string

Take 𝑦1 = {types of x} ≈ semantics of x

Take 𝑦2 = {properties of x} ≈ semantics of x

Calculate M(x, 𝑦1) and M(x, 𝑦2) for all symbols in 600k datasets

Page 37: The Empirical Turn in Knowledge Representation

But variables do encode meaning!

Fraction of datasets with redundancy for types/predicatesat significance level > 0.99

BTW, this is 600.000 datapoints (RDF docs)

Page 38: The Empirical Turn in Knowledge Representation

Very different network structures

for different predicates

Tobias Kuhn Wouter Beekhttp://ceur-ws.org/Vol-1946/paper-05.pdf

Page 43: The Empirical Turn in Knowledge Representation

Summary &

So what…

Page 44: The Empirical Turn in Knowledge Representation

• We now have larger KB’s than ever before

• We now have the instruments to observe and analyse these very large KB’s

• We can use these insights for better tools:

– query & inference

– publish & maintain

– visualise & explain

– …

Page 45: The Empirical Turn in Knowledge Representation

But my secret hope is that this will help us to understand the patterns of knowledge:

AI as a computational theory of knowledge