the empirical turn in knowledge representation

Creative Commons CC BY 3.0:

allowed to share & remix

(also commercial)

but must attribute

Frank van Harmelen

The empirical turn in

Knowledge Representation

Contributions from many peoplein the KR&R group over many years.

And thanks to NWO for a 750k€ TOP grant for this

KR in the pre-empirical era

Handbook of Knowledge Representation(1000 pages, ToC alone is 14 pages)

• propositional logic & satisfiability solvers

• first order logic & resolution

• description logic

• constraint (logic) programming

• nonmonotonic reasoning

• belief revision

• qualitative reasoning

• model-based diagnosis

• bayesian networks

• temporal logic

• spatial reasoning

• epistemic logic

• deontic logic

• situation calculus

• default logic

• event calculus• ……

KR metrics in the pre-empirical era

KR = logic• Show small examples

• Prove properties(expressivity, complexity)

• Give algorithms(sound, complete)

KR = engineering• Build applications

• Show high performance

• Show low engineering costs

BUT AN EXPERIMENTIN THE PAST 10 YEARS

MADE IT POSSIBLE TO DO SOMETHING VERY DIFFERENT:

OBSERVE HOWKNOWLEDGE REPRESENTATIONS BEHAVE

AT VERY LARGE SCALE

Rest of the talk

• Which KR’s were part of the experiment?

• How much of it was there to observe?

• How did we manage to observe it?

• What did we learn from observing it?

Which KR’s ?

RDF (for non-logicians)

RDF (for logicians)

• ground binary predicate: 𝑃(𝑂1, 𝑂2)

• Limited existential variables: ∃𝑥: 𝑃 𝐶1, 𝑥 ∧ 𝑃 𝐶2, 𝑥

• Type is unary predicate: 𝑇𝑖 𝑥

• Subtypes ∀𝑥: 𝑇1 𝑥 → 𝑇2(𝑥)

• Type restrictions ∀𝑥, 𝑦: 𝑃 𝑥, 𝑦 → 𝑇1 𝑥 ∧ 𝑇2(𝑦)

• Equality: 𝑂1= 𝑂2• Extensions to DL:

– Distjointness of types

– Cardinality restrictions (0,1)

– always decidable: sub-FOL.

RDF deduction

OWL Semantics

How much is there to observe?

± 45-100 billion facts

1 fact

How big is 100 billion

Denny Vrandečić – AIFB, Universität Karlsruhe ≈ 1 fact per web-page

100 billion golfballs ≈ Jupiter

x T

[<x> IsOfType <T>]

differentowners & locations

< analgesic >

BTW: How did it get so big?

On the Web, anybody can say anything about anything

BTW: How did it get so big?

On the Web, anybody can say anything about anything

x T

R

How did you manage to observe it?

LOD LaundromatBeek & Rietveld et al. 2014, LOD laundromat: a uniform way of publishing other people's dirty datahttp://lodlaundromat.org/pdf/lodlaundry.pdf

HDTFernández & Martínez-Prieto & Gutiérrez, 2013, Binary RDF representation for publication and exchange (HDT)

LDFVerborgh & Vander Sande et al. 2014, Web-Scale Querying through Linked Data Fragments

http://lodlaundromat.org/pdf/lodlaundry.pdf

http://lodlaundromat.org/

http://lodlaundromat.org/

LOD-a-lothttp://lod-a-lot.lod.labs.vu.nl/

http://lod-a-lot.lod.labs.vu.nl/

Surprisingly efficient

1 file

28,362,198,927 unique triples

>650K data documents

524 GB of disk space

16 GB of RAM

Only €305,- hardware cost

Meta-Data for a lot of LODhttp://www.semantic-web-journal.net/content/meta-data-lot-lod-2

http://www.semantic-web-journal.net/content/meta-data-lot-lod-2

Statistics (boring)

triples 28,362,198,927

subject 3,214,347,198

predicates 1,168,932

objects 3,178,409,386

literals 5.3B

Re-use is fairly high… or not…

Analysing Logical identity

Joe Raad Wouter BeekESWC2018, under submission

Identity clusters

LOD-a-lot Filehttp: //lod-a-lot.lod.labs.vu.nl

[Fernández 2017]

558 millions owl:sameAs (309 millions distinct terms)

≈ 4 hours

1. Extracting all owl:sameAs statements on the LOD

HDT File(4.5 GB)

HDT File(4.5 GB)

IdentityClosure

1

IdentityClosure

2

IdentityClosure

89 387 082…

- The largest Identity Closure contains 177 794 terms(contains all the countries in the world, Albert Enstein, « empty string », etc.)

- The smallest Identity Closure contains 2 terms

x owl:sameAs y z owl:sameAs y

Identity Closure x y z

2. Generating the Identity Closure

Identity Closure « Cities »

3. Detecting Communities (using the Louvain Algorithm)

This network (i.e. identity closure) has a community structure, as it can be grouped into different sets of nodes, with each set of nodes being densely connected internally.

Goal: Find (and later Evaluate) the most “suspicious” identity links (i.e. the links between different communities)

4. Application: debugging identity statements

Identity closure containing the term

“dbpedia.org/page/Barack_Obama”

This Identity Closure contains 388 terms (i.e. 387 distinct terms are owl:sameAs this term)

95 communities detectedlargest community = 99 terms

4. Application: debugging identity statements

comm0

comm3

2 links

Community 0

1. dbpedia.org/resource/B_hussein_obama2. dbpedia.org/resource/Barack_H_Obama,_Jr3. dbpedia.org/resource/Barak_hussein_obama4. dbpedia.org/resource/President_Barack5. dbpedia.org/resource/Senator_Barack_Obama6. dbpedia.org/resource/Obama

…

99. dbpedia.org/resource/Hussein_Obama

Community 3

1. dbpedia.org/resource/Presidency_of_Barack_Obama2. dbpedia.org/resource/Barack_Obama_Administration3. dbpedia.org/resource/Barack_Obama_Cabinet4. dbpedia.org/resource/Obama_White_House5. dbpedia.org/resource/Obama_regime6. dbpedia.org/resource/America_under_Obama

…

52. dbpedia.org/resource/Presidential_transition_of_Barack_Obama

Symbols or words?

Steven de Rooij Peter Bloem Wouter Beek (ISWC 2016)http://www.cs.vu.nl/~frankh/postscript/ISWC2016.pdf

http://www.cs.vu.nl/~frankh/postscript/ISWC2016.pdf

Symbols or words?

Symbol names are supposed to be meaningless

Aspirin headache

analgesic pain

symptomdrug

treats

treats

Measure mutual information content between string and semantics of a symbol

E(x) = efficient encoding of x

Mutual information content

M(x,y) =E(x) + E(y) – E(x,y)

Take x = symbol name of x as a string

Take 𝑦1 = {types of x} ≈ semantics of x

Take 𝑦2 = {properties of x} ≈ semantics of x

Calculate M(x, 𝑦1) and M(x, 𝑦2) for all symbols in 600k datasets

But variables do encode meaning!

Fraction of datasets with redundancy for types/predicatesat significance level > 0.99

BTW, this is 600.000 datapoints (RDF docs)

Very different network structures

for different predicates

Tobias Kuhn Wouter Beekhttp://ceur-ws.org/Vol-1946/paper-05.pdf

http://ceur-ws.org/Vol-1946/paper-05.pdf

skos:exactMatch

https://wouterbeek.github.io/img/exact_match.png

https://wouterbeek.github.io/img/exact_match.png

foaf:knows

https://wouterbeek.github.io/img/foaf_knows.png

https://wouterbeek.github.io/img/foaf_knows.png

osspr:contains

https://wouterbeek.github.io/img/contains.png

https://wouterbeek.github.io/img/contains.png

Geopolitics:hasborderWith

https://wouterbeek.github.io/img/has_border_with.png

https://wouterbeek.github.io/img/has_border_with.png

Summary &

So what…

• We now have larger KB’s than ever before

• We now have the instruments to observe and analyse these very large KB’s

• We can use these insights for better tools:

– query & inference

– publish & maintain

– visualise & explain

– …

But my secret hope is that this will help us to understand the patterns of knowledge:

AI as a computational theory of knowledge

the empirical turn in knowledge representation

Science