interactive reasoning in large and uncertain rdf knowledge bases

Interactive Reasoning in Large and Uncertain

RDF Knowledge BasesMartin Theobald

Joint work with:Maximilian Dylla, Timm Meiser, Ndapa Nakashole, Christina Tefliuodi, Yafang Wang, Mohamed Yahya,

Mauro Sozio, and Fabian Suchanek

Max Planck Institute Informatics

French Marriage Problem

...

marriedTo: person personmarriedTo: person person

marriedTo_French: person personmarriedTo_French: person person

2

x,y,z: marriedTo(x,y) marriedTo(x,z) y=z

x,y,z: marriedTo(x,y) marriedTo(x,z) y=z

French Marriage Problem

Facts in KB: New facts or fact candidates:marriedTo (Hillary, Bill)marriedTo (Carla, Nicolas)marriedTo (Angelina, Brad)

marriedTo (Cecilia, Nicolas)marriedTo (Carla, Benjamin)marriedTo (Carla, Mick)marriedTo (Michelle, Barack)marriedTo (Yoko, John)marriedTo (Kate, Leonardo)marriedTo (Carla, Sofie)marriedTo (Larry, Google)

1) for recall: pattern-based harvesting2) for precision: consistency reasoning1) for recall: pattern-based harvesting2) for precision: consistency reasoning

3x,y,z: marriedTo(x,y) marriedTo(x,z) y=z x,y,z: marriedTo(x,y) marriedTo(x,z) y=z

Agenda

– URDF: Reasoning in Uncertain Knowledge Bases • Resolving uncertainty at query-time• Lineage of answers• Propositional vs. probabilistic reasoning• Temporal reasoning extensions

– UViz: The URDF Visualization Frontend• Demo!

4

URDF: Reasoning in Uncertain KB’s

• Knowledge harvesting from the Web may yield knowledge bases which are

– Incomplete bornIn(Albert_Einstein,?x) {}

– IncorrectbornIn(Albert_Einstein,?x) {Stuttgart}

– Inconsistent bornIn(Albert_Einstein,?x) {Ulm, Stuttgart}

• Combine grounding of first-order logic rules with additional step of consistency reasoning– Propositional – Constrained Weighted MaxSat– Probabilistic – Lineage & Possible Worlds Semantics

At query time! 5

[Theobald,Sozio,Suchanek,Nakashole: MPII Tech-Report‘10]

0.7 0.2

Soft Rules vs. Hard Constraints

(Soft) Inference Rules vs. (Hard) Consistency Constraints

• People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

• People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=z

• People are not married to more than one person (at the same time, in most countries?)marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)6

[0.6] [0.2]

Soft Rules vs. Hard Constraints (ct’d)

Enforce FD‘s (e.g., mutual exclusion) as hard constraints:

Generalize to other forms of constraints:Hard constraint Soft constraint

hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t

firstPaper(x,p) firstPaper(y,q) author(p,x) author(p,y) inYear(p) > inYear(q)+5years

hasAdvisor(x,y)[0.6]

livesIn(x,y) type(y,City) locatedIn(y,z) type(z,Country) livesIn(x,z)

hasAdvisor(x,y) hasAdvisor(x,z) y=zCombine soft and hard constraintsNo longer regular MaxSatConstrained (weighted) MaxSat instead

Combine soft and hard constraintsNo longer regular MaxSatConstrained (weighted) MaxSat instead

7

Datalog-style grounding (deductive & potentially recursive soft rules)Datalog-style grounding (deductive & potentially recursive soft rules)

Deductive Grounding (SLD Resolution/Datalog)

\/\/

R1R1 R3R3R2R2

RDF Base Facts F1: marriedTo(Bill, Hillary) F2: represents(Hillary, New_York)

F3: governorOf(Bill, Arkansas)

RDF Base Facts F1: marriedTo(Bill, Hillary) F2: represents(Hillary, New_York)

F3: governorOf(Bill, Arkansas)

/\/\

F1F1 \/\/

R2R2 R3R3R1R1

F2F2

XX F3F3

… XXXX

Answers (derived facts): livesIn(Bill, Arkansas) livesIn(Bill, New_York)

Answers (derived facts): livesIn(Bill, Arkansas) livesIn(Bill, New_York)

8

QuerylivesIn(Bill, ?x)

QuerylivesIn(Bill, ?x)

8

First-Order Rules (Horn clauses)R1: livesIn(?x, ?y) :- marriedTo(?x, ?z), livesIn(?z, ?y)R2: livesIn(?x, ?y) :- represents(?x, ?y)R3: livesIn(?x, ?y) :- governorOf(?x, ?y)

First-Order Rules (Horn clauses)R1: livesIn(?x, ?y) :- marriedTo(?x, ?z), livesIn(?z, ?y)R2: livesIn(?x, ?y) :- represents(?x, ?y)R3: livesIn(?x, ?y) :- governorOf(?x, ?y)

URDF: Reasoning ExampleRules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

[0.4]

graduatedFrom(x,y) graduatedFrom(x,z) x=z

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

[0.4]

graduatedFrom(x,y) graduatedFrom(x,z) x=zJeffJeff

StanfordStanford

UniversityUniversity

type[1.0]

SurajitSurajit

PrincetonPrinceton

DavidDavid

Computer ScientistComputer Scientist

worksAt[0.9]

type[1.0]

type[1.0]

type[1.0]type[1.0]

graduatedFrom[0.6]

graduatedFrom[0.7]

graduatedFrom[0.9]

hasAdvisor[0.8]hasAdvisor[0.7]

9

KB: Base Facts

Derived FactsgradFr(Surajit,Stanfor

d)gradFr(David,Stanford)

Derived FactsgradFr(Surajit,Stanfor

d)gradFr(David,Stanford)

graduatedFrom[?]graduatedFrom[?]

URDF: CNF Construction & MaxSat Solving

10

[Theobald,Sozio,Suchanek,Nakashole: MPII Tech-Report‘10]

Query graduatedFrom(?x,?y)

Query graduatedFrom(?x,?y)CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))

(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAcademicAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) graduatedFrom(David, Stanford)

CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))

(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAcademicAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton) graduatedFrom(David, Stanford)

0.4

0.4

0.90.80.70.60.70.90.0

1) Deductive Grounding– Yields only facts and rules which

are relevant for answering the query (dependency graph D)

2) Boolean Formula in CNF consisting of– Grounded hard rules– Grounded soft rules (weighted)– Base facts (weighted)

3) Propositional Reasoning– Compute truth assignment for

all facts in D such that the sum of weights is maximized

Compute “most likely” possible world

URDF: Lineage & Possible Worlds

11

1) Deductive Grounding– Same as before, but trace

lineage of query answers

2) Lineage DAG (not CNF!) consisting of– Grounded hard rules– Grounded soft rules– Base factsplus: derivation structure

3) Probabilistic Inference– Marginalization: aggregate probabilities of all

possible worlds where the answer is “true”

– Drop “impossible worlds”

\/\/

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)

graduatedFrom(Surajit, Stanford)

/\/\

graduatedFrom(Surajit,

Princeton)[0.7]


Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

worksAt(Jeff,Stanford)[0.9]


Stanford)[0.6]


Stanford)[0.6]

Query graduatedFrom(Surajit,?y)

Query graduatedFrom(Surajit,?y)

0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266

1-(1-0.72)x(1-0.6)=0.888

0.8x0.9 =0.72

0.6

0.7

0.90.8

Grounding first-order Horn formulas (Datalog)

– Decidable– EXPTIME-complete, PSPACE-complete (including recursion, but in P w/o recursion)

Max-Sat (Constrained & Weighted)– NP-complete

Probabilistic inference in graphical models– #P-complete

Grounding first-order Horn formulas (Datalog)

– Decidable– EXPTIME-complete, PSPACE-complete (including recursion, but in P w/o recursion)

Max-Sat (Constrained & Weighted)– NP-complete

Probabilistic inference in graphical models– #P-complete

Classes & Complexities

12

FOL OWLOWL-DL/lite

Horn

Monte Carlo Simulation (I)

13

[Karp,Luby,Madras: J.Alg.’89]

F = X1X2 X1X3 X2X3F = X1X2 X1X3 X2X3

cnt = 0repeat N times randomly choose X1, X2, X3 {0,1} if F(X1, X2, X3) = 1 then cnt = cnt+1P = cnt/Nreturn P /* Pr'(F) */

cnt = 0repeat N times randomly choose X1, X2, X3 {0,1} if F(X1, X2, X3) = 1 then cnt = cnt+1P = cnt/Nreturn P /* Pr'(F) */

Theorem: If N ≥ (1/ Pr(F)) × (4 ln(2/)/2) then: Pr[ | P/Pr(F) - 1 | > ] < Theorem: If N ≥ (1/ Pr(F)) × (4 ln(2/)/2) then: Pr[ | P/Pr(F) - 1 | > ] <

May be very big for small Pr(F)

May be very big for small Pr(F)

X1X2 X1X3

X2X3

Boolean formula:

Zero/One-estimatortheorem

Works for any F(not in PTIME)Works for any F(not in PTIME)

Naïve sampling:

Monte Carlo Simulation (II)

14

cnt = 0; S = Pr(C1) + … + Pr(Cm)repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1 if C1=0 and C2=0 and … and Ci-1= 0 then cnt = cnt+1P = cnt/Nreturn P /* Pr'(F) */

cnt = 0; S = Pr(C1) + … + Pr(Cm)repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1 if C1=0 and C2=0 and … and Ci-1= 0 then cnt = cnt+1P = cnt/Nreturn P /* Pr'(F) */

Theorem: If N ≥ (1/m) × (4 ln(2/)/2) then: Pr[ |P/Pr(F) - 1| > ] < Theorem: If N ≥ (1/m) × (4 ln(2/)/2) then: Pr[ |P/Pr(F) - 1| > ] <

F = C1 C2 . . . CmF = C1 C2 . . . Cm

Improved sampling:

Now it’s better

Now it’s better

Only for F in DNF in PTIMEOnly for F in DNF in PTIME

[Karp,Luby,Madras: J.Alg.’89]Boolean formula in DNF:

Learning “Soft” Rules Extend Inductive Logic Programming (ILP) techniques to

large and incomplete knowledge bases

15

Software tools: alchemy.cs.washington.eduhttp://www.doc.ic.ac.uk/~shm/progol.html http://dtai.cs.kuleuven.be/ml/systems/claudien

Goal: learn livesIn(?x,?y) bornIn(?x,?y)

LiLilivesIn(x,y)

bornIn(x,y)

livesIn(x,z)

Positive ExampleslivesIn(?x,?y) bornIn(?x,?y)

Negative Examples livesIn(?x,?y) bornIn(?x,?y) livesIn(?x,?z)

LiLi

Background knowledge

http://alchemy.cs.washington.edu/

http://www.doc.ic.ac.uk/~shm/progol.html

http://dtai.cs.kuleuven.be/ml/systems/claudien

More Variants of Consistency Reasoning

• Propositional Reasoning– Constrained Weighted MaxSat solver

• Lineage & Possible Worlds (independent base facts)– Monte Carlo simulations (Luby-Karp)

• First-Order Logic & Probabilistic Graphical Models– Markov Logic (currently via interface to Alchemy*)

[Richardson & Domingos: ML’06]– Even more general: Factor Graphs [McCallum et al. 2008]– MCMC sampling for probabilistic inference

16

*Alchemy – Open-Source AI: http://alchemy.cs.washington.edu/

http://alchemy.cs.washington.edu/

Experiments

• URDF: SLD grounding & MaxSat solving

17

|C| - # literals in soft rules|S| - # literals in hard rules

• URDF vs. Markov Logic (MAP inference & MC-SAT)

• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts• Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16 soft

rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)• Asymptotic runtime checks: runtime comparisons for synthetic soft rule expansions

French Marriage Problem (Revisited)

Facts in KB:

New fact candidates:

marriedTo (Hillary, Bill)marriedTo (Carla, Nicolas)marriedTo (Angelina, Brad)

marriedTo (Cecilia, Nicolas)marriedTo (Carla, Benjamin)marriedTo (Carla, Mick)divorced (Madonna, Guy)domPartner (Angelina, Brad)

1:

2:

3:

validFrom (2, 2008)

validFrom (4, 1996) validUntil (4, 2007)validFrom (5, 2010)validFrom (6, 2006)validFrom (7, 2008)

4: 5:6:7:8:

JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

18

Challenge: Temporal Knowledge HarvestingFor all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night!

19

Difficult Dating

20

(Even More Difficult) Implicit Datingvague dates relative datesvague dates relative dates

narrative textrelative ordernarrative textrelative order

22

TARSQI: Extracting Time Annotations

Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A pro-democracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE" VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re-election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the <TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3> election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0" endPoint="t8“ TYPE="DURATION" VAL="P5Y">another five years </TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.

[Verhagen et al: ACL‘05]http://www.timeml.org/site/tarsqi/

extraction errors!extraction errors!

23

13 Relations between Time Intervals

A Before B B After A

A Meets B B MetBy A

A Overlaps B B OverlappedBy A

A Starts B B StartedBy A

A During B B Contains A

A Finishes B B FinishedBy A

A Equal B

A B

AB

AB

AB

AB

AB

AB

[Allen, 1984; Allen & Hayes, 1989]

24

0.08 0.120.16

Possible Worlds in Time (I)

0.36

0.40.6

State Relation

‘03 ‘05 ‘07

1.0

Base Facts

DerivedFacts

[Wang,Yahya,Theobald: VLDB/MUD Workshop ‘10]

0.20.20.10.4

‘05‘00 ‘02

0.9

‘07

State Relation

‘04

‘03 ‘04 ‘07‘05

25

playsFor(Beckham,Real) playsFor(Ronaldo,Real)

playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2)

teamMates(Beckham, Ronaldo,T3)

State

0.06

0.300.12

0.20.30.6

Possible Worlds in Time (II)

0.30.5

State Event

0.06

Event

‘95 ‘98 ‘02 ‘96 ‘99 ‘00

‘96 ‘98 ‘00 ‘01‘99

0.54

0.9 1.0

‘01playsFor(Beckham, United) wonCup(United, ChampionsLeague)

Base Facts

DerivedFacts

Non-independent

Independent

[Wang,Yahya,Theobald: VLDB/MUD Workshop ‘10]

26

playsFor(Beckham, United, T1) wonCup(United, ChampionsL,T2) overlaps(T1,T2)

won(Beckham, ChampionsL,T3)

• Closed and complete representation model (incl. lineage) Stanford Trio project [Widom: CIDR’05, Benjelloun et al: VLDB’06]

• Interval computation remains linear in the number of bins• Confidence computation per bin is #P-complete

In general requires possible-worlds-based sampling techniques (Luby-Karp, Gibbs sampling, etc.)

Need

Lineage!Need

Lineage!0.12

Agenda

– URDF: Reasoning in Uncertain Knowledge Bases • Resolving uncertainty at query-time• Lineage of answers• Propositional vs. probabilistic reasoning• Temporal reasoning extensions

– UViz: The URDF Visualization Frontend• Demo!

27

UViz: The URDF Visualization Engine

• UViz System Architecture– Flash client– Tomcat server (JRE)– Relational backend

(JDBC)– Remote Method

Invocation & Object Serialization (BlazeDS)

28

UViz: The URDF Visualization Engine

Demo!Demo!

29

interactive reasoning in large and uncertain rdf knowledge bases

Documents