10 years of probabilistic querying – what next?

10 Years of Probabilistic Querying – What Next?

Martin TheobaldUniversity of Antwerp

Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,

Andre Melo, Iris Miliaraki, Luc de Raedt, Mauro Sozio, Fabian Suchanek

“ The important thing is to not stop questioning ... One cannot help but be in awe when contemplating the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery every day.”

- Albert Einstein, 1936

“The Marvelous Structure of Reality”Joseph M. HellersteinKeynote at WebDB 2003, San Diego

Look, There is Structure!

The important thing is to not stop questioning

Plethora of natural-language-processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition &

Disambiguation (NERD) Dependency Parsing Semantic Role Labeling

Text is not just “unstructured data”C1

Plethora of natural-language-processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition &

Disambiguation (NERD) Dependency Parsing Semantic Role Labeling

Text is not just “unstructured data” But:

Even the best NLP tools frequently yield errors

Facts found on the Web are logically inconsistent

Web-extracted knowledge bases are inherently incomplete

bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)

type(Jeff, Author)[0.9]author(Jeff, Drag_Book)[0.8]author(Jeff,Cind_Book)[0.6]

worksAt(Jeff, Bell_Labs)[0.7]type(Jeff, CEO)[0.4]

Information ExtractionYAGO/DBpedia et al.

New fact candidates

>120 M facts for YAGO2(mostly from Wikipedia infoboxes)

100’s M additional facts from Wikipedia free-text

instanceOfinstanceOf

instanceOf

http://www.mpi-inf.mpg.de/yago-naga/

YAGO Knowledge BaseEntity

Max_Planck

Apr 23, 1858

Person

Countrysubclass

Locationsubclass

subclass

bornOn

“Max Planck”

subclass

Oct 4, 1947 diedOn

KielbornInNobel Prize

Erwin_Planck

fatherOfhasWon

Scientist

“Max Karl Ernst Ludwig Planck”

Physicistsubclass

Biologist

subclass

Germany

Politician

Angela Merkel

Schleswig-Holstein

“Angela Dorothea Merkel”

Oct 23, 1944 diedOn

Organizationsubclass

Max_Planck Society

instanceOf

subclass

“Angela Merkel”

citizenOf

locatedIn locatedIn

subclass

3 M entities, 120 M facts100 relations, 200k

classesaccuracy 95%

subclass

instanceOf

8http://linkeddata.org/

Linked Open Data

As of Sept. 2011: >200 linked-data sources >30 billion RDF triples >400 million owl:sameAs links

Maybe Even More Importantly:Linked Vocabularies!

Source: http://en.wikipedia.org/wiki/Linked_data

LinkedData.org Instance & class links

between DBpedia, WordNet, OpenCyc, GeoNames, and many more…

Schema.org Common vocabulary released

by Google, Yahoo!, BING to annotate Web pages, incl. links to DBpedia.

Micro-Formats: RDFa (W3C)

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" version="XHTML+RDFa 1.0" xml:lang="en"> <head><title>Martin's Home Page</title> <base href="http://adrem.ua.ac.be/~tmartin/" /> <meta property="dc:creator" content= "Martin" /> </head>

As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase

Application 1: Enrichment of Search Results

“Recent Advances in Structured Data and the Web.” Alon Y. Halevy, Keynote at ICDE 2013, Brisbane

It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder.Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.After discovering that Salander has hacked into his computer, he persuades her to assisthim with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."

Application II: Machine Reading

same same

samesame

uncleOf

headOf

affairWith

affairWith enemyOf

Etzioni, Banko, Cafarella: Machine Reading. AAAI’06Mitchell, Carlson et al.: Toward an Architecture for Never-Ending Language Learning. AAAI’10

Application III: Natural-Language Question Answering

evi.com (formerly trueknowledge.com)

Application III: Natural-Language Question Answering

wolframalpha.com>10 trillion(!) facts>50,000 search algorithms>5,000 visualizations

IBM Watson: Deep Question Answering

99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain

This town is known as "Sin City" & its downtown is "Glitter Gulch"

William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel

As of 2010, this is the only former Yugoslav republic in the EU

www.ibm.com/innovation/us/watson/index.htm

Knowledge back-ends

Question classification & decomposition

D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/

Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13

Natural-Language QA over Linked Data

<question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string> <string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string> <string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string> <string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string> <string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">cours d'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri . } </query></question>

<topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft></topic>

https://inex.mmci.uni-saarland.de/tracks/lod/

INEX Linked Data Track, CLEF 2012-13

Natural-Language QA over Linked Data

Outline Probabilistic Databases

Stanford’s Trio System: Data, Uncertainty & Lineage Handling Uncertain RDF Data: URDF (Max-Planck-Institute/U-Antwerp)

Probabilistic & Temporal Databases Sequenced vs. Non-Sequenced Semantics Interval Alignment & Probabilistic Inference

Probabilistic Programming Statistical Relational Learning Learning “Interesting” Deduction Rules

Summary & Challenges

Probabilistic databases combine first-order logic and probability theory

in an elegant way:Declarative: Queries formulated in

SQL/Relational Algebra/Datalog, support for updates, transactions, etc.

Deductive: Well-studied resolution algorithms for SQL/Relational Algebra/Datalog (top-down/bottom-up), indexes, automatic query optimization

Scalable (?): Polynomial data complexity (SQL), but #P-complete for the probabilistic inference

Probabilistic Databases: A Panacea to All of the Afore

Tasks?

Special Cases:

Query Semantics: (“Marginal Probabilities”) Run query Q against each instance Di; for each

answer tuple t, sum up the probabilities of all instances Di where t exists.

A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic

database instances Di.WorksAt(Sub, Obj)Jeff StanfordJeff Princeton

WorksAt(Sub, Obj)Jeff Stanford

WorksAt(Sub, Obj)Jeff Princeton

WorksAt(Sub, Obj)

0.42 0.18 0.28 0.12

WorksAt(Sub, Obj)

Jeff Stanford 0.6Jeff Princeton 0.7

(1) Tuple-independent PDB (II) Block-independent PDB

Note: (I) and (II)

are not equivalent!

Probabilistic Database

WorksAt(Sub, Obj)

Jeff Stanford 0.6Princeton 0.4

Stanford Trio System

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

[Widom: CIDR 2005]

Trio’s Data Model

1. Alternatives: uncertainty about value

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Three possible

instances

Six possible

instances

Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe): uncertainty about presence

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Betty blue, Acura

Trio’s Data Model

1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty

Still six possible instances, each with a

probability

Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2

Betty blue, Acura 0.6

So Far: Model is Not Closed

Saw (witness, car)Cath

yHonda ∥ Mazda

Drives (person, car)Jimmy, Toyota ∥ Jimmy,

MazdaBilly, Honda ∥ Frank, Honda

Hank, Honda

SuspectsJimmyBilly ∥ FrankHank

Suspects = πperson(Saw ⋈ Drives)

Does not correctlycapture possibleinstances in theresult

CANNOT

Example with Lineage

Saw (witness, car)

Honda ∥ Mazda

Drives (person, car)

Jimmy, Toyota ∥ Jimmy, Mazda

Billy, Honda ∥ Frank, Honda

Hank, Honda

Suspects

Billy ∥ Frank

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Example with Lineage

ID Saw (witness, car)11

Honda ∥ Mazda

ID Drives (person, car)21

Jimmy, Toyota ∥ Jimmy, Mazda

Billy, Honda ∥ Frank, Honda

Hank, Honda

ID Suspects31

Billy ∥ Frank

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Correctly captures possible instances inthe result

Operational Semantics

Closure: up-arrow always

exists

Completeness: any (finite) set of possible instances can be represented

D1, D2,…, Dn D1’, D2’, …, Dm’

possibleinstances

Q on eachinstance

rep. ofinstances

directimplementation

But: data complexity is #P-complete!

Summary on Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

Theorem: ULDBs are closed and complete.Formally studied properties like minimization,

equivalence, approximation and membership based on lineage. [Benjelloun, Das Sarma, Halevy,

Widom, Theobald: VLDB-J. 2008]

Basic Complexity Issue

Theorem [Valiant:1979]For a Boolean expression E, computing Pr(E) is #P-complete

NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT

The decision problem for 2CNF is in PTIME.The counting problem for 2CNF is already #P-complete.

(will be coming back to this later again…)

[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"]

…back to Information

Extraction

bornIn(Barack, Honolulu)bornIn(Barack, Kenya)

Uncertain RDF (URDF): Facts & Rules Extensional Knowledge (the “facts”)

High-confidence facts: existing knowledge base (“ground truth”) New fact candidates: extracted fact candidates with confidences Linked-Data & integration of various knowledge sources: Ontology merging or explicitly linked facts (owl:sameAs, owl:equivProp.) Large “Probabilistic Database” of RDF facts

Intensional Knowledge (the “rules”) Soft rules: deductive grounding & lineage (Datalog/SLD resolution) Hard rules: consistency constraints (more general FOL rules) Propositional & probabilistic inference At query-time!

Soft Rules vs. Hard Rules(Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

People are not married to more than one person (at the same time, in most countries?)marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

[0.8] [0.5]

Soft Rules vs. Hard Rules(Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

People are not married to more than one person (at the same time, in most countries?)marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

Deductive Database:

Datalog, core of SQL & Relational Algebra, RDF/S, OWL2-RL, etc.

More General FOL Constraints: Datalog plus constraints,

X-tuples in PDB’s,owl:FunctionalProperty, owl:disjointWith,

URDF Running Example

Stanford

University

type[1.0]

Surajit

Princeton

Computer Scientist

worksAt[0.9]

type[1.0] type[1.0]

type[1.0]type[1.0]

graduatedFrom[0.6]graduatedFrom[0.7]

graduatedFrom[0.9]

hasAdvisor[0.8]hasAdvisor[0.7]

KB: RDF Base Facts

Derived FactsgradFr(Surajit,Stanford

)gradFr(David,Stanford)

graduatedFrom[?]graduatedFrom[?] graduatedFrom[?]

graduatedFrom[?]

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

graduatedFrom(x,y) graduatedFrom(x,z) y=z

Basic Types of Inference MAP Inference

Find the most likely assignment to query variables y under a given evidence x.

Compute: arg max y P( y | x) (NP-complete for MaxSAT)

Marginal/Success Probabilities Probability that query y is true in a random world under a given evidence x. Compute: ∑y P( y | x) (#P-complete

already for conjunctive queries)

General Route: Grounding & MaxSAT Solving

Query graduatedFrom(x, y)

CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))

(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

0.9 0.8 0.7 0.6 0.7 0.9

1) Grounding– Consider only facts (and

rules) which are relevant for answering the query

2) Propositional formula in CNF, consisting of– Grounded soft & hard rules– Weighted base facts

3) Propositional Reasoning– Find truth assignment to

facts such that the total weight of the satisfied clauses is maximized

MAP inference: compute “most likely” possible world

[Theobald,Sozio,Suchanek,Nakashole: VLDS’12]

Find: arg max y P( y | x) Resolves to a variant of

MaxSAT for propositional formulas

URDF: MaxSAT Solving with Soft & Hard Rules

{ graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) }

{ graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) }

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

0.9 0.8 0.7 0.6 0.7 0.9

Special case: Horn-clauses as soft rules & mutex-constraints as hard rules

Compute W0 = ∑clauses C w(C) P(C is satisfied);For each hard constraint S { For each fact f in St { Compute Wf+

t = ∑clauses C w(C) P(C is sat. | f = true); } Compute WS-

t = ∑clauses C w(C) P(C is sat. | St = false); Choose truth assignment to f in St that maximizes Wf+

t , WS-t ;

Remove satisfied clauses C; t++;}

• Runtime: O(|S||C|)

• Approximation guarantee of 1/2

MaxSAT Alg.

Experiment (I): MAP Inference

URDF: Grounding & MaxSAT solving

|C| - # literals in grounded soft rules|S| - # literals in grounded hard rules

URDF MaxSAT vs. Markov Logic

(MAP inference & MC-SAT)

• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts• Query Answering: Deductive grounding & MaxSAT solving for 10 queries

over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)

• Asymptotic runtime checks via synthetic (random) soft rule expansions

Basic Types of Inference✔ MAP Inference

Find the most likely assignment to query variables y under a given evidence x.

[Yahya,Theobald: RuleML’11 Dylla,Miliaraki,Theobald: ICDE’13]

Deductive Grounding with Lineage (SLD Resolution in Datalog/Prolog)

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

worksAt(Jeff,Stanford)[0.9]

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

graduatedFrom(x,y) graduatedFrom(x,z) y=zBase FactsgraduatedFrom(Surajit, Princeton)

[0.7]graduatedFrom(Surajit, Stanford)

[0.6]graduatedFrom(David, Princeton)

[0.9]hasAdvisor(Surajit, Jeff) [0.8]hasAdvisor(David, Jeff) [0.7]worksAt(Jeff, Stanford) [0.9]type(Princeton, University) [1.0]type(Stanford, University) [1.0]type(Jeff, Computer_Scientist) [1.0]type(Surajit, Computer_Scientist)

[1.0]type(David, Computer_Scientist)

Lineage & Possible Worlds

1) Deductive Grounding Dependency graph of the query Trace lineage of individual query

answers2) Lineage DAG (not in CNF),

consisting of Grounded soft & hard rules Probabilistic base facts

3) Probabilistic Inference Compute marginals: P(Q): sum up the probabilities

of all possible worlds that entail the query answers’ lineage

P(Q|H): drop “impossible worlds”

Princeton)[0.7]

Stanford)[0.6]

0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266

1-(1-0.72)x(1-0.6)=0.888

0.8x0.9=0.72

A(B (CD)) A(B (CD))

[Das Sarma,Theobald,Widom: ICDE’08 Dylla,Miliaraki,Theobald: ICDE’13]

Possible Worlds Semantics

Q2: A(B(CD))

1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.30241 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336 1 1 0 1 0 … = 0.07561 1 0 0 0 … = 0.00841 0 1 1 0 … = 0.2016 1 0 1 0 0 … = 0.0224 1 0 0 1 0 … = 0.05041 0 0 0 0 … = 0.0056 0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.12960 1 1 0 1 0.3x0.6x0.8x0.1 = 0.01440 1 0 1 1 0.3x0.6x0.2x0.9 = 0.03240 1 0 0 1 0.3x0.6x0.2x0.1 = 0.00360 0 1 1 1 0.3x0.4x0.8x0.9 = 0.08640 0 1 0 0 … = 0.0096 0 0 0 1 0 … = 0.0216 0 0 0 0 0 … = 0.0024

0.2664

P(Q2)=0.2664P(Q2|H)=0.2664 / 0.412 = 0.6466

P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903

0.0784

Hard rule H: A (B (CD))

Inference in Probabilistic Databases Safe query plans [Dalvi,Suciu: VLDB-J’07]

Can propagate confidences along with relational operators.

Read-once functions [Sen,Deshpande,Getoor: PVLDB’10] Can factorize Boolean formula (in polynomial time) into

read-once form, where every variable occurs at most once.

Knowledge compilation [Olteanu et al.: ICDT’10, ICDT’11] Can decompose Boolean formula into ordered binary

decision diagram (OBDD), such that inference resolves to independent-and and independent-or operations over the decomposed formula.

Top-k pruning [Ré,Davli,Suciu: ICDE’07; Karp,Luby,Madras: J-Alg.’89] Can return top-k answers based on lower and upper

bounds, even without knowing their exact marginal probabilities.

Multi-Simulation: run multiple Markov-Chain-Monte-Carlo (MCMC) simulations in parallel.

Monte Carlo Simulation (I)

E = X1X2 v X1X3 v X2X3

cnt = 0repeat N times randomly choose X1, X2, X3 {0,1} if E(X1, X2, X3) = 1 then cnt = cnt+1P = cnt/Nreturn P /* estimate for true Pr(F) */

Theorem: If N ≥ (1/ Pr(E)) × (4 ln(2/d)/e2) then: Pr[ | P/Pr(E) - 1 | > e ] < d

N may be very big for small

X1X2 X1X3

Boolean formula:

Zero/One-EstimatorTheorem

Works for any E(not in PTIME)

Naïve sampling:

[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"Karp,Luby,Madras: J-Alg.’89]

cnt = 0; S = Pr(C1) + … + Pr(Cm)repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1 if C1= 0 and C2= 0 and … and Ci-1= 0 then cnt = cnt+1P = cnt/Nreturn P /* estimate for true Pr(E) */

Theorem: If N ≥ (1/m) × (4 ln(2/d)/e2) then: Pr[ |P/Pr(E) - 1| > e ] < d

E = C1 v C2 v . . . v Cm

Importance sampling:

This is better!

Only for E in DNF in PTIME

Boolean formula in DNF:

Monte Carlo Simulation (II)[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"

Karp,Luby,Madras: J-Alg.’89]

Top-k Ranking by Marginal Probabilities

Stanford)[0.6]

Princeton)[0.7]A B

graduatedFrom(Surajit, y=Stanford)

Datalog/SLD resolution Top-down grounding allows

us to compute lower and upper bounds on the marginal probabilities of answer candidates before rules are fully grounded.

Subgoals may represent sets of answer candidates.

First-order lineage formulas: Φ(Q1) = A Φ(Q2) = B y gradFrom(Surajit,y)

Prune entire set of answer candidates represented by Φ.

[Dylla,Miliaraki,Theobald: ICDE’13]

Bounds for First-Order FormulasTheorem 1:Given a (partially grounded) first-order lineage formula Φ: Φ(Q2) = B y gradFrom(S,y) Lower bound Plow (for all query answers that can be obtained

from grounding Φ) Substitute y gradFrom(S,y) with false (or true if negated).

Plow(Q2) = P(B false) = P(B) = 0.6 Upper bound Pup (for all query answers that can be obtained

from grounding Φ) Substitute y gradFrom(S,y) with true (or false if negated).

Pup(Q2) = P(B true) = P(true) = 1.0

Proof: (sketch)Substitution of a subformula with false reduces the number of models (possible worlds) that satisfy Φ; substitution with true increases them.

[Dylla,Miliaraki,Theobald: ICDE’13]

Theorem II:Let Φ1,…, Φn be a series of first-order lineage formulas obtained from grounding Φ via SLD resolution, and let φ be the propositional lineage formula of an answer obtained from this grounding procedure. Then rewriting each Φi according to Theorem 1 into Pi,low and Pi,up creates a monotonic series of lower and upper bounds that converges to P(φ). 0 = P(false) P(B false) = 0.6 P(B (C D)) = 0.888

P(B true) = P(true) = 1

Proof: (sketch, via induction)Substitution of true with a formula reduces the number of models that satisfy Φ; substitution of false with a formula increases this number.

Convergence of Bounds[Dylla,Miliaraki,Theobald: ICDE’13]

P2,up(Qj)

P2,low(Qj)

Top-k Pruning“Fagin’s Algorithm”

Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup Return the top-k queue at the t’th grounding step

when: Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Drop Qj from

the Candidates queue.

P1,up(Qj)

P1,low(Qj)

k-th lower bound Pn,up(Qj)

Pn,low(Qj)

#SLD steps t

Marginal probability

[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]

Top-k Stopping Condition“Fagin’s Algorithm”

Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup Return the top-k queue at the t’th grounding step

when: Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Stop and return

the top-2 query answers.

2-nd lower bound

[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]

Pt,up(Q2)

Pt,low(Q2)

Pt,up(Q1)

Pt,low(Q1)

@SLD step t

Marginal probability

Pt,low(Qm)

Pt,up(Qm)

Experiment (II): Computing Marginals

IMDB data with 26 Mio facts about movies, directors, actors, etc. 4 query patterns, each instantiated to 1,000 queries (showing

runtime averages) Q1 – safe, non-repeating hierarchical Q2 – unsafe, repeating hierarchical Q3 – unsafe, head-hierarchical Q4 – general unsafe

Non-Rep. Hierarchical Q1 Rep. Hierarchical Q2 Head-Hierarchical Q3 General Unsafe Q410

10,000

100,000

Top-10 Top-20 Top-50 MultiSim Top-10 MultiSim Top-20 MultiSim Top-50 Postgres MayBMS Trio

Experiment (II): Computing Marginals

Runtime vs. number of top-k results;

single join query

Percentage of tuples scanned from input relations

IMDB data set, 26 Mio facts

Basic Types of Inference

MAP Inference Find the most likely assignment to query variables y

under a given evidence x.

Probabilistic & Temporal Database

Sequenced Semantics & Snapshot Reducibility: Built-in semantics: reduce temporal-relational operators to

their non-temporal counterparts at each snapshot of the database.

Coalesce/split tuples with consecutive time intervals based on their lineages.

Non-Sequenced Semantics Queries can freely manipulate timestamps just like regular

attributes. Single temporal operator ≤T supports all of Allen’s 13 temporal relations.

Deduplicate tuples with overlapping time intervals based on their lineages.

A temporal-probabilistic database DTp (compactly) encodes a probability distribution over a finite set of deterministic database instances Di and a finite time

domain T.BornIn(Sub,Obj)

DeNiro Green- which

[1943, 1944)

DeNiro Tribeca [1998, 1999)

0.6[Dignös, Gamper, Böhlen: SIGMOD’12]

[Dylla,Miliaraki,Theobald: PVLDB’13]

Wedding(Sub,Obj)

DeNiro Abbott [1936, 1940)

Divorce(Sub,Obj)

Temporal Alignment & Deduplication

Non-Sequenced Semantics:

1936 1976 1988

f2 ¬f3

f2 f3f1 ¬f3

(f1 f3) (f1 ¬f3)

(f1 f3) (f1 ¬f3) (f2 f3) (f2 ¬f3)

(f1 f3) (f2 ¬f3)

MarriedTo(X,Y)[Tb1,tmax) Wedding(X,Y)[Tb1,Te1) ¬Divorce(X,Y)[Tb2,Te2)MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2

BaseFacts

DeducedFacts

Dedupl.Facts

Wedding(DeNiro,Abbott)Wedding(DeNiro,Abbott)

Divorce(DeNiro,Abbott)

0.08 0.120.16

0.4 0.6

‘03 ‘05 ‘07playsFor(Beckham, Real, T1)

Base Facts

DerivedFacts

0.20.20.1

‘05‘00 ‘02 ‘07playsFor(Ronaldo, Real, T2)

‘03 ‘04 ‘07‘05

playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2, T3)

t3 teamMates(Beckham, Ronaldo, t3)

teamMates(Beckham, Ronaldo, T3)

Inference in Temporal-Probabilistic Databases

[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]

0.4 0.6

‘03 ‘05 ‘07playsFor(Beckham, Real, T1)

Base Facts

DerivedFacts

playsFor(Ronaldo, Real, T2)

0.20.20.1

‘05‘00 ‘02 ‘07‘04

0.08 0.120.16

‘03 ‘04 ‘07‘05

playsFor(Zidane, Real, T3)

teamMates(Beckham, Zidane, T5)

teamMates(Ronaldo, Zidane, T6)

Non-independentIndependent

playsFor(Beckham, Real, T1)Base Facts

DerivedFacts

playsFor(Ronaldo, Real, T2)

playsFor(Zidane, Real, T3)

teamMates(Beckham, Zidane, T5)

teamMates(Ronaldo, Zidane, T6)

Non-independentIndependent

Closed and complete representation model (incl. lineage)

Temporal alignment is linear in the number of input intervals

Confidence computation per interval remains #P-hard

In general requires Monte Carlo approximations (Luby-Karp for DNF, MCMC-style sampling), decompositions, or top-k pruning

Lineage!

Experiment (III): Temporal Alignment & Probabilistic

Inference

1,827 base facts with temporal annotations Extracted from free-text biographies from Wikipedia, IMDB.com,

biography.com 11 handcrafted temporal deduction rules, e.g.: MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2

21 handcrafted temporal consistency constraints, e.g.: BornIn(X,Y)[Tb1,Te1) MarriedTo(X,Y)[Tb2,Te2) Te1 ≤T Tb2

Statistical Relational Learning& Probabilistic Programming

SRL combines first-order logic and probabilistic inference

Employs relational data as input, but with a focus also on learning the relations (facts, rules & weights)

Knowledge compilation for probabilistic inference Including recent techniques for “lifted inference”

Markov Logic Networks (U-Washington) Grounding of weighted first-order rules over a function-

free Herbrand base into an undirected graphical model ( Markov Random Field)

Probabilistic Programming (ProbLog, KU-Leuven) Deductive grounding over a set of base facts into a

directed graphical model (SLD proofs Bayesian Net)

Learning Soft Deduction Rules

Inductive learning algorithm based on dynamic programming

A-priori-style pre-filtering & pruning of low-support join patterns

Adaptation of confidence and support measures from data mining

Learning “interesting” rules with constants and type constraints

Ground truth for IivesIn (only partially known)Knowledge base for livesIn (known positive examples)Facts inferred for livesIn from the body of the rule bornIn (only partially correct)

Goal: Inductively learn soft rule S: livesIn(x,y) :- bornIn(x,y)

||||)|()(

BodyBodyHeadBodyHeadPSconfidence

Learning “Interesting” Deduction Rules (I)

Plots for the distribution of income versus quarterOfBirth and educationLevel over actual US census data from Oct. 2009 (>1 billion RDF facts).

Divergence from “Overall population” shows strong correlation of income with educationLevel but not with quarterOfBirth.

income

Overall populationQOB-1st-quarterQOB-2nd-quarterQOB-3rd-quarterQOB-4th-quarter

incomere

eq.income(x, y), quarterOfBirth(x, z) income(x, y), educationLevel(x, z)

Learning “Interesting” Deduction Rules (II)

Divergence measured using Kullback-Leibler or χ2 between “Overall population” with “Nursery school to Grade 4” and “Professional school degree” over discretized income domain.

low medium high

income(x, y) :- educationLevel(x, z)

income(x, “low”) :- educationLevel(x, “Nursery school to Grade 4”)

income(x, “medium”) :- educationLevel(x, “Professional school degree”)

income(x, “high”) :- educationLevel(x, “Professional school degree”)

– Overall population– Nursery school to Grade 4– Professional school degree

income

ontological rigor

human effort

Names & PatternsEntities & Relations

Open-Domain & Unsuper-vised

Domain-OrientedTrainingData/Facts

< „N. Portman“, „honored with“, „Academy Award“>, < „Jeff Bridges“, „expected to win“, „Oscar“ > < „Bridges“, „nominated for“, „Academy Award“>

wonAward: Person Prizetype (Meryl_Streep, Actor)wonAward (Meryl_Streep, Academy_Award)

wonAward (Natalie_Portman, Academy_Award)wonAward (Ethan_Coen, Palme_d‘Or)

Summary & Challenges (I)Web-Scale Information Extraction

ontological rigor

human effort

Names & PatternsEntities & Relations

Open-Domain & Unsuper-vised

Domain-OrientedTrainingData/Facts

Summary & Challenges (I)Web-Scale Information Extraction

TextRunner

ReadTheWeb / NELL

Probase

Freebase

YAGO2DBpedia 3.8

Sofie /Prospera

StatSnowball /EntityCube

WebTables /FusionTables

Summary & Challenges (II)RDF is Not Enough!

HMM’s, CRF’s, PCFG’s (not in this talk) yield much richer output structures than just triplets.

Extraction of facts beliefs, modifiers,

modalities, etc.. intensional knowledge

(“rules”) More expressive but

canonical representation of natural language: trees, graphs, objects, frames (F-logic, KL-one, CycL, OWL, etc.)

All combined with structured probabilistic inference

Summary & Challenges (III)Scalable Probabilistic Inference“Domain-liftable” FO formula

X,YPeople smokes(X) friends(X,Y) smokes(Y)

Exact lifted inference via Weighted-First-Order-Model-Counting (WFOMC) Probability of a query depends only on the size(s) of the domain(s), a

weight function for the first-order predicates, and the weighted model count over the FO d-DNNF.

[Van den Broeck’11]: Compilation rules and inference algorithms for FO d-DNNF’s[Jha & Suciu’11]: Classes of SQL queries which admit polynomial-size (propositional) d-DNNF’s

Approximate inference via Belief Propagation, MCMC-style sampling, etc.

Scale-out via distributed grounding & inference: TrinityRDF (MSR), GraphLab2 (MIT)

CorrespondingFO d-DNNF circuit

Final Summary

Text is not just unstructured data.

Probabilistic databases combine first-order logic and probability theory in an elegant way.

Natural-Language-Processing people, Database guys, and Machine-Learning folks: it’s about time to join your forces!

urdf.mpi-inf.mpg.de

References Maximilian Dylla, Iris Miliaraki, and Martin Theobald: A Temporal-Probabilistic Database Model

for Information Extraction. PVLDB 6(14), 2013 (to appear) Maximilian Dylla, Iris Miliaraki, and Martin Theobald: Top-k Query Processing in Probabilistic

Databases with Non-Materialized Views. ICDE 2013, 2013 Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time

Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS 2012: 15-20 Mohamed Yahya, Martin Theobald: D2R2: Disk-Oriented Deductive Reasoning in a RISC-Style

RDF Engine. RuleML America 2011: 81-96 Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive Reasoning in Uncertain RDF

Knowledge Bases. CIKM 2011: 2557-2560 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Scalable Knowledge Harvesting

with High Precision and High Recall. WSDM 2011: 227-236 Maximilian Dylla, Mauro Sozio, Martin Theobald: Resolving Temporal Conflicts in Inconsistent

RDF Knowledge Bases. BTW 2011: 474-493 Yafang Wang, Mohamed Yahya, Martin Theobald: Time-aware Reasoning in Uncertain

Knowledge Bases. MUD 2010: 51-65 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Find your Advisor: Robust

Knowledge Gathering from the Web. WebDB 2010 Anish Das Sarma, Martin Theobald, Jennifer Widom: LIVE: A Lineage-Supported Versioned

DBMS. SSDBM 2010: 416-433 Anish Das Sarma, Martin Theobald, Jennifer Widom: Exploiting Lineage for Confidence

Computation in Uncertain and Probabilistic Databases. ICDE 2008: 1023-1032 Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom:

Databases with uncertainty and lineage. VLDB J. 17(2): 243-264 (2008)

10 years of probabilistic querying – what next?

Documents

probabilistic data modeling and querying for location...

comparison of next-day probabilistic severe weather...

trigger querying

towards querying probabilistic knowledge...

querying big data tractability revisited for querying big...

querying bio2rdf data

querying structured text

querying a database

querying infinite databases

probabilistic inference reading: chapter 13 next time: how...

querying factorized probabilistic triple databases denis...

can graph neural networks help logic reasoning?probabilistic...

4. querying

querying linked data

representing and querying correlated tuples in probabilistic...

probabilistic inductive querying using problog · problog...

next generation nuclear plant probabilistic risk ......

7 querying xml

querying advanced probabilistic models: from relational

querying rdf data