10 years of probabilistic querying – what next?
Post on 25-Feb-2016
46 Views
Preview:
DESCRIPTION
TRANSCRIPT
10 Years of Probabilistic Querying – What Next?
Martin TheobaldUniversity of Antwerp
Joint work with Maximilian Dylla, Sairam Gurajada, Angelika Kimmig,
Andre Melo, Iris Miliaraki, Luc de Raedt, Mauro Sozio, Fabian Suchanek
“ The important thing is to not stop questioning ... One cannot help but be in awe when contemplating the mysteries of eternity, of life, of the marvelous structure of reality. It is enough if one tries merely to comprehend a little of this mystery every day.”
- Albert Einstein, 1936
“The Marvelous Structure of Reality”Joseph M. HellersteinKeynote at WebDB 2003, San Diego
Look, There is Structure!
The important thing is to not stop questioning
Look, There is Structure!
Plethora of natural-language-processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition &
Disambiguation (NERD) Dependency Parsing Semantic Role Labeling
Text is not just “unstructured data”C1
Look, There is Structure!
Plethora of natural-language-processing techniques & tools Part-Of-Speech (POS) Tagging Named-Entity Recognition &
Disambiguation (NERD) Dependency Parsing Semantic Role Labeling
Text is not just “unstructured data” But:
Even the best NLP tools frequently yield errors
Facts found on the Web are logically inconsistent
Web-extracted knowledge bases are inherently incomplete
C1
bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)
type(Jeff, Author)[0.9]author(Jeff, Drag_Book)[0.8]author(Jeff,Cind_Book)[0.6]
worksAt(Jeff, Bell_Labs)[0.7]type(Jeff, CEO)[0.4]
Information ExtractionYAGO/DBpedia et al.
New fact candidates
>120 M facts for YAGO2(mostly from Wikipedia infoboxes)
100’s M additional facts from Wikipedia free-text
7
instanceOfinstanceOf
instanceOf
instanceOf
http://www.mpi-inf.mpg.de/yago-naga/
YAGO Knowledge BaseEntity
Max_Planck
Apr 23, 1858
Person
City
Countrysubclass
Locationsubclass
subclass
bornOn
“Max Planck”
means
subclass
Oct 4, 1947 diedOn
KielbornInNobel Prize
Erwin_Planck
fatherOfhasWon
Scientist
means
“Max Karl Ernst Ludwig Planck”
Physicistsubclass
Biologist
subclass
Germany
Politician
Angela Merkel
Schleswig-Holstein
State
“Angela Dorothea Merkel”
Oct 23, 1944 diedOn
Organizationsubclass
Max_Planck Society
instanceOf
means
instanceOf
subclass
means
“Angela Merkel”
means
citizenOf
locatedIn locatedIn
subclass
3 M entities, 120 M facts100 relations, 200k
classesaccuracy 95%
subclass
instanceOf
8http://linkeddata.org/
Linked Open Data
As of Sept. 2011: >200 linked-data sources >30 billion RDF triples >400 million owl:sameAs links
9
Maybe Even More Importantly:Linked Vocabularies!
Source: http://en.wikipedia.org/wiki/Linked_data
LinkedData.org Instance & class links
between DBpedia, WordNet, OpenCyc, GeoNames, and many more…
Schema.org Common vocabulary released
by Google, Yahoo!, BING to annotate Web pages, incl. links to DBpedia.
Micro-Formats: RDFa (W3C)
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" version="XHTML+RDFa 1.0" xml:lang="en"> <head><title>Martin's Home Page</title> <base href="http://adrem.ua.ac.be/~tmartin/" /> <meta property="dc:creator" content= "Martin" /> </head>
10
As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase
11
Application 1: Enrichment of Search Results
“Recent Advances in Structured Data and the Web.” Alon Y. Halevy, Keynote at ICDE 2013, Brisbane
12
It’s about the disappearance forty years ago of Harriet Vanger, a young scion of one of the wealthiest families in Sweden, and about her uncle, determined to know the truth about what he believes was her murder.Blomkvist visits Henrik Vanger at his estate on the tiny island of Hedeby.The old man draws Blomkvist in by promising solid evidence against Wennerström. Blomkvist agrees to spend a year writing the Vanger family history as a cover for the real assignment: the disappearance of Vanger's niece Harriet some 40 years earlier. Hedeby is home to several generations of Vangers, all part owners in Vanger Enterprises. Blomkvist becomes acquainted with the members of the extended Vanger family, most of whom resent his presence. He does, however, start a short lived affair with Cecilia, the niece of Henrik.After discovering that Salander has hacked into his computer, he persuades her to assisthim with research. They eventually become lovers, but Blomkvist has trouble getting close to Lisbeth who treats virtually everyone she meets with hostility. Ultimately the two discover that Harriet's brother Martin, CEO of Vanger Industries, is secretly a serial killer.A 24-year-old computer hacker sporting an assortment of tattoos and body piercings supports herself by doing deep background investigations for Dragan Armansky, who, in turn, worries that Lisbeth Salander is “the perfect victim for anyone who wished her ill."
Application II: Machine Reading
same same
samesame
same
same
uncleOf
owns
hires
headOf
affairWith
affairWith enemyOf
Etzioni, Banko, Cafarella: Machine Reading. AAAI’06Mitchell, Carlson et al.: Toward an Architecture for Never-Ending Language Learning. AAAI’10
13
Application III: Natural-Language Question Answering
evi.com (formerly trueknowledge.com)
14
Application III: Natural-Language Question Answering
wolframalpha.com>10 trillion(!) facts>50,000 search algorithms>5,000 visualizations
15
IBM Watson: Deep Question Answering
99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain
This town is known as "Sin City" & its downtown is "Glitter Gulch"
William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel
As of 2010, this is the only former Yugoslav republic in the EU
www.ibm.com/innovation/us/watson/index.htm
Knowledge back-ends
Question classification & decomposition
D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.
16
http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/
Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13
Natural-Language QA over Linked Data
<question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string> <string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string> <string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string> <string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string> <string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">cours d'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri . } </query></question>
17
<topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft></topic>
https://inex.mmci.uni-saarland.de/tracks/lod/
INEX Linked Data Track, CLEF 2012-13
Natural-Language QA over Linked Data
18
Outline Probabilistic Databases
Stanford’s Trio System: Data, Uncertainty & Lineage Handling Uncertain RDF Data: URDF (Max-Planck-Institute/U-Antwerp)
Probabilistic & Temporal Databases Sequenced vs. Non-Sequenced Semantics Interval Alignment & Probabilistic Inference
Probabilistic Programming Statistical Relational Learning Learning “Interesting” Deduction Rules
Summary & Challenges
19
Probabilistic databases combine first-order logic and probability theory
in an elegant way:Declarative: Queries formulated in
SQL/Relational Algebra/Datalog, support for updates, transactions, etc.
Deductive: Well-studied resolution algorithms for SQL/Relational Algebra/Datalog (top-down/bottom-up), indexes, automatic query optimization
Scalable (?): Polynomial data complexity (SQL), but #P-complete for the probabilistic inference
Probabilistic Databases: A Panacea to All of the Afore
Tasks?
C2
20
Special Cases:
Query Semantics: (“Marginal Probabilities”) Run query Q against each instance Di; for each
answer tuple t, sum up the probabilities of all instances Di where t exists.
A probabilistic database Dp (compactly) encodes a probability distribution over a finite set of deterministic
database instances Di.WorksAt(Sub, Obj)Jeff StanfordJeff Princeton
WorksAt(Sub, Obj)Jeff Stanford
WorksAt(Sub, Obj)Jeff Princeton
WorksAt(Sub, Obj)
0.42 0.18 0.28 0.12
WorksAt(Sub, Obj)
p
Jeff Stanford 0.6Jeff Princeton 0.7
(1) Tuple-independent PDB (II) Block-independent PDB
Note: (I) and (II)
are not equivalent!
Probabilistic Database
WorksAt(Sub, Obj)
p
Jeff Stanford 0.6Princeton 0.4
21
Stanford Trio System
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage
Uncertainty-Lineage Databases (ULDBs)
[Widom: CIDR 2005]
22
Trio’s Data Model
1. Alternatives: uncertainty about value
Saw (witness, color, car)
Amy red, Honda ∥ red, Toyota ∥ orange, Mazda
Three possible
instances
23
Six possible
instances
Trio’s Data Model
1. Alternatives2. ‘?’ (Maybe): uncertainty about presence
?
Saw (witness, color, car)
Amy red, Honda ∥ red, Toyota ∥ orange, Mazda
Betty blue, Acura
24
Trio’s Data Model
1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty
Still six possible instances, each with a
probability
?
Saw (witness, color, car)
Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2
Betty blue, Acura 0.6
25
So Far: Model is Not Closed
Saw (witness, car)Cath
yHonda ∥ Mazda
Drives (person, car)Jimmy, Toyota ∥ Jimmy,
MazdaBilly, Honda ∥ Frank, Honda
Hank, Honda
SuspectsJimmyBilly ∥ FrankHank
Suspects = πperson(Saw ⋈ Drives)
???
Does not correctlycapture possibleinstances in theresult
CANNOT
26
Example with Lineage
ID
Saw (witness, car)
11
Cathy
Honda ∥ Mazda
ID
Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
ID
Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
Suspects = πperson(Saw ⋈ Drives)
???
λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23
27
Example with Lineage
ID Saw (witness, car)11
Cathy
Honda ∥ Mazda
ID Drives (person, car)21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
ID Suspects31
Jimmy
32
Billy ∥ Frank
33
Hank
Suspects = πperson(Saw ⋈ Drives)
???
λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23
Correctly captures possible instances inthe result
(4)
28
Operational Semantics
Closure: up-arrow always
exists
Completeness: any (finite) set of possible instances can be represented
Dp
D1, D2,…, Dn D1’, D2’, …, Dm’
Dp′
possibleinstances
Q on eachinstance
rep. ofinstances
directimplementation
But: data complexity is #P-complete!
29
Summary on Trio’s Data Model
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage
Uncertainty-Lineage Databases (ULDBs)
Theorem: ULDBs are closed and complete.Formally studied properties like minimization,
equivalence, approximation and membership based on lineage. [Benjelloun, Das Sarma, Halevy,
Widom, Theobald: VLDB-J. 2008]
30
Basic Complexity Issue
Theorem [Valiant:1979]For a Boolean expression E, computing Pr(E) is #P-complete
NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT
The decision problem for 2CNF is in PTIME.The counting problem for 2CNF is already #P-complete.
(will be coming back to this later again…)
[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"]
…back to Information
Extraction
bornIn(Barack, Honolulu)bornIn(Barack, Kenya)
Uncertain RDF (URDF): Facts & Rules Extensional Knowledge (the “facts”)
High-confidence facts: existing knowledge base (“ground truth”) New fact candidates: extracted fact candidates with confidences Linked-Data & integration of various knowledge sources: Ontology merging or explicitly linked facts (owl:sameAs, owl:equivProp.) Large “Probabilistic Database” of RDF facts
Intensional Knowledge (the “rules”) Soft rules: deductive grounding & lineage (Datalog/SLD resolution) Hard rules: consistency constraints (more general FOL rules) Propositional & probabilistic inference At query-time!
Soft Rules vs. Hard Rules(Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)
People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z
People are not married to more than one person (at the same time, in most countries?)marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z
disjoint(t1,t2)
[0.8] [0.5]
[0.8] [0.5]
Soft Rules vs. Hard Rules(Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)
People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z
People are not married to more than one person (at the same time, in most countries?)marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z
disjoint(t1,t2)
Deductive Database:
Datalog, core of SQL & Relational Algebra, RDF/S, OWL2-RL, etc.
More General FOL Constraints: Datalog plus constraints,
X-tuples in PDB’s,owl:FunctionalProperty, owl:disjointWith,
etc.
URDF Running Example
Jeff
Stanford
University
type[1.0]
Surajit
Princeton
David
Computer Scientist
worksAt[0.9]
type[1.0] type[1.0]
type[1.0]type[1.0]
graduatedFrom[0.6]graduatedFrom[0.7]
graduatedFrom[0.9]
hasAdvisor[0.8]hasAdvisor[0.7]
KB: RDF Base Facts
Derived FactsgradFr(Surajit,Stanford
)gradFr(David,Stanford)
graduatedFrom[?]graduatedFrom[?] graduatedFrom[?]
graduatedFrom[?]
Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)
[0.4]
graduatedFrom(x,y) graduatedFrom(x,z) y=z
Basic Types of Inference MAP Inference
Find the most likely assignment to query variables y under a given evidence x.
Compute: arg max y P( y | x) (NP-complete for MaxSAT)
Marginal/Success Probabilities Probability that query y is true in a random world under a given evidence x. Compute: ∑y P( y | x) (#P-complete
already for conjunctive queries)
General Route: Grounding & MaxSAT Solving
Query graduatedFrom(x, y)
CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))
(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))
(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))
(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))
worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)
1000
1000
0.4
0.4
0.9 0.8 0.7 0.6 0.7 0.9
1) Grounding– Consider only facts (and
rules) which are relevant for answering the query
2) Propositional formula in CNF, consisting of– Grounded soft & hard rules– Weighted base facts
3) Propositional Reasoning– Find truth assignment to
facts such that the total weight of the satisfied clauses is maximized
MAP inference: compute “most likely” possible world
[Theobald,Sozio,Suchanek,Nakashole: VLDS’12]
Find: arg max y P( y | x) Resolves to a variant of
MaxSAT for propositional formulas
URDF: MaxSAT Solving with Soft & Hard Rules
{ graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) }
{ graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) }
(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))
(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))
worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)
0.4
0.4
0.9 0.8 0.7 0.6 0.7 0.9
S: M
utex
-con
st.
Special case: Horn-clauses as soft rules & mutex-constraints as hard rules
C: W
eigh
ted
Horn
cla
uses
(CNF
)
Compute W0 = ∑clauses C w(C) P(C is satisfied);For each hard constraint S { For each fact f in St { Compute Wf+
t = ∑clauses C w(C) P(C is sat. | f = true); } Compute WS-
t = ∑clauses C w(C) P(C is sat. | St = false); Choose truth assignment to f in St that maximizes Wf+
t , WS-t ;
Remove satisfied clauses C; t++;}
• Runtime: O(|S||C|)
• Approximation guarantee of 1/2
MaxSAT Alg.
Experiment (I): MAP Inference
URDF: Grounding & MaxSAT solving
|C| - # literals in grounded soft rules|S| - # literals in grounded hard rules
URDF MaxSAT vs. Markov Logic
(MAP inference & MC-SAT)
• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts• Query Answering: Deductive grounding & MaxSAT solving for 10 queries
over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)
• Asymptotic runtime checks via synthetic (random) soft rule expansions
Basic Types of Inference✔ MAP Inference
Find the most likely assignment to query variables y under a given evidence x.
Compute: arg max y P( y | x) (NP-complete for MaxSAT)
Marginal/Success Probabilities Probability that query y is true in a random world under a given evidence x. Compute: ∑y P( y | x) (#P-complete
already for conjunctive queries)
[Yahya,Theobald: RuleML’11 Dylla,Miliaraki,Theobald: ICDE’13]
Deductive Grounding with Lineage (SLD Resolution in Datalog/Prolog)
\/
/\
graduatedFrom(Surajit,
Princeton)[0.7]
hasAdvisor(Surajit,Jeff)
[0.8]
worksAt(Jeff,Stanford)[0.9]
graduatedFrom(Surajit,
Stanford)[0.6]
Query graduatedFrom(Surajit, y)
C D
A B
A(B (CD)) A(B (CD))
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)Q1 Q2
Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)
[0.4]
graduatedFrom(x,y) graduatedFrom(x,z) y=zBase FactsgraduatedFrom(Surajit, Princeton)
[0.7]graduatedFrom(Surajit, Stanford)
[0.6]graduatedFrom(David, Princeton)
[0.9]hasAdvisor(Surajit, Jeff) [0.8]hasAdvisor(David, Jeff) [0.7]worksAt(Jeff, Stanford) [0.9]type(Princeton, University) [1.0]type(Stanford, University) [1.0]type(Jeff, Computer_Scientist) [1.0]type(Surajit, Computer_Scientist)
[1.0]type(David, Computer_Scientist)
[1.0]
Lineage & Possible Worlds
1) Deductive Grounding Dependency graph of the query Trace lineage of individual query
answers2) Lineage DAG (not in CNF),
consisting of Grounded soft & hard rules Probabilistic base facts
3) Probabilistic Inference Compute marginals: P(Q): sum up the probabilities
of all possible worlds that entail the query answers’ lineage
P(Q|H): drop “impossible worlds”
\/
/\
graduatedFrom(Surajit,
Princeton)[0.7]
hasAdvisor(Surajit,Jeff)
[0.8]
worksAt(Jeff,Stanford)[0.9]
graduatedFrom(Surajit,
Stanford)[0.6]
Query graduatedFrom(Surajit, y)
0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266
1-(1-0.72)x(1-0.6)=0.888
0.8x0.9=0.72
C D
A B
A(B (CD)) A(B (CD))
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)Q1 Q2
[Das Sarma,Theobald,Widom: ICDE’08 Dylla,Miliaraki,Theobald: ICDE’13]
Possible Worlds Semantics
A:0.7
B:0.6
C:0.8
D:0.9
Q2: A(B(CD))
P(W)
1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.30241 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336 1 1 0 1 0 … = 0.07561 1 0 0 0 … = 0.00841 0 1 1 0 … = 0.2016 1 0 1 0 0 … = 0.0224 1 0 0 1 0 … = 0.05041 0 0 0 0 … = 0.0056 0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.12960 1 1 0 1 0.3x0.6x0.8x0.1 = 0.01440 1 0 1 1 0.3x0.6x0.2x0.9 = 0.03240 1 0 0 1 0.3x0.6x0.2x0.1 = 0.00360 0 1 1 1 0.3x0.4x0.8x0.9 = 0.08640 0 1 0 0 … = 0.0096 0 0 0 1 0 … = 0.0216 0 0 0 0 0 … = 0.0024
1.0
0.2664
0.412
P(Q2)=0.2664P(Q2|H)=0.2664 / 0.412 = 0.6466
P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903
0.0784
Hard rule H: A (B (CD))
Inference in Probabilistic Databases Safe query plans [Dalvi,Suciu: VLDB-J’07]
Can propagate confidences along with relational operators.
Read-once functions [Sen,Deshpande,Getoor: PVLDB’10] Can factorize Boolean formula (in polynomial time) into
read-once form, where every variable occurs at most once.
Knowledge compilation [Olteanu et al.: ICDT’10, ICDT’11] Can decompose Boolean formula into ordered binary
decision diagram (OBDD), such that inference resolves to independent-and and independent-or operations over the decomposed formula.
Top-k pruning [Ré,Davli,Suciu: ICDE’07; Karp,Luby,Madras: J-Alg.’89] Can return top-k answers based on lower and upper
bounds, even without knowing their exact marginal probabilities.
Multi-Simulation: run multiple Markov-Chain-Monte-Carlo (MCMC) simulations in parallel.
Monte Carlo Simulation (I)
E = X1X2 v X1X3 v X2X3
cnt = 0repeat N times randomly choose X1, X2, X3 {0,1} if E(X1, X2, X3) = 1 then cnt = cnt+1P = cnt/Nreturn P /* estimate for true Pr(F) */
Theorem: If N ≥ (1/ Pr(E)) × (4 ln(2/d)/e2) then: Pr[ | P/Pr(E) - 1 | > e ] < d
N may be very big for small
Pr(E)
X1X2 X1X3
X2X3
Boolean formula:
Zero/One-EstimatorTheorem
Works for any E(not in PTIME)
Naïve sampling:
[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"Karp,Luby,Madras: J-Alg.’89]
cnt = 0; S = Pr(C1) + … + Pr(Cm)repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1 if C1= 0 and C2= 0 and … and Ci-1= 0 then cnt = cnt+1P = cnt/Nreturn P /* estimate for true Pr(E) */
Theorem: If N ≥ (1/m) × (4 ln(2/d)/e2) then: Pr[ |P/Pr(E) - 1| > e ] < d
E = C1 v C2 v . . . v Cm
Importance sampling:
This is better!
Only for E in DNF in PTIME
Boolean formula in DNF:
Monte Carlo Simulation (II)[Suciu & Dalvi: SIGMOD’05 Tutorial on "Foundations of Probabilistic Answers to Queries"
Karp,Luby,Madras: J-Alg.’89]
Top-k Ranking by Marginal Probabilities
\/
graduatedFrom(Surajit,
Stanford)[0.6]
Query graduatedFrom(Surajit, y)
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)Q1 Q2
graduatedFrom(Surajit,
Princeton)[0.7]A B
graduatedFrom(Surajit, y=Stanford)
/\
hasAdvisor(Surajit,Jeff)
[0.8]
worksAt(Jeff,Stanford)[0.9]
C D
Datalog/SLD resolution Top-down grounding allows
us to compute lower and upper bounds on the marginal probabilities of answer candidates before rules are fully grounded.
Subgoals may represent sets of answer candidates.
First-order lineage formulas: Φ(Q1) = A Φ(Q2) = B y gradFrom(Surajit,y)
Prune entire set of answer candidates represented by Φ.
[Dylla,Miliaraki,Theobald: ICDE’13]
Bounds for First-Order FormulasTheorem 1:Given a (partially grounded) first-order lineage formula Φ: Φ(Q2) = B y gradFrom(S,y) Lower bound Plow (for all query answers that can be obtained
from grounding Φ) Substitute y gradFrom(S,y) with false (or true if negated).
Plow(Q2) = P(B false) = P(B) = 0.6 Upper bound Pup (for all query answers that can be obtained
from grounding Φ) Substitute y gradFrom(S,y) with true (or false if negated).
Pup(Q2) = P(B true) = P(true) = 1.0
Proof: (sketch)Substitution of a subformula with false reduces the number of models (possible worlds) that satisfy Φ; substitution with true increases them.
[Dylla,Miliaraki,Theobald: ICDE’13]
Theorem II:Let Φ1,…, Φn be a series of first-order lineage formulas obtained from grounding Φ via SLD resolution, and let φ be the propositional lineage formula of an answer obtained from this grounding procedure. Then rewriting each Φi according to Theorem 1 into Pi,low and Pi,up creates a monotonic series of lower and upper bounds that converges to P(φ). 0 = P(false) P(B false) = 0.6 P(B (C D)) = 0.888
P(B true) = P(true) = 1
Proof: (sketch, via induction)Substitution of true with a formula reduces the number of models that satisfy Φ; substitution of false with a formula increases this number.
Convergence of Bounds[Dylla,Miliaraki,Theobald: ICDE’13]
P2,up(Qj)
P2,low(Qj)
Top-k Pruning“Fagin’s Algorithm”
Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup Return the top-k queue at the t’th grounding step
when: Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Drop Qj from
the Candidates queue.
P1,up(Qj)
P1,low(Qj)
k-th lower bound Pn,up(Qj)
Pn,low(Qj)
#SLD steps t
Marginal probability
1
0
[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]
Top-k Stopping Condition“Fagin’s Algorithm”
Maintain two disjoint queues: Top-k queue sorted by Plow and Candidates sorted by Pup Return the top-k queue at the t’th grounding step
when: Pi,low(Qk) | Qk Top-k > Pi,up(Qj) | Qj Candidates Stop and return
the top-2 query answers.
2-nd lower bound
[Fagin et al.’01; Balke,Kießling’02; Dylla,Miliaraki,Theobald: ICDE’13]
k = 2
Pt,up(Q2)
Pt,low(Q2)
Pt,up(Q1)
Pt,low(Q1)
@SLD step t
Marginal probability
1
0
Pt,low(Qm)
Pt,up(Qm)
Experiment (II): Computing Marginals
IMDB data with 26 Mio facts about movies, directors, actors, etc. 4 query patterns, each instantiated to 1,000 queries (showing
runtime averages) Q1 – safe, non-repeating hierarchical Q2 – unsafe, repeating hierarchical Q3 – unsafe, head-hierarchical Q4 – general unsafe
Non-Rep. Hierarchical Q1 Rep. Hierarchical Q2 Head-Hierarchical Q3 General Unsafe Q410
100
1,000
10,000
100,000
Top-10 Top-20 Top-50 MultiSim Top-10 MultiSim Top-20 MultiSim Top-50 Postgres MayBMS Trio
ms
Experiment (II): Computing Marginals
Runtime vs. number of top-k results;
single join query
Percentage of tuples scanned from input relations
IMDB data set, 26 Mio facts
Basic Types of Inference
✔
MAP Inference Find the most likely assignment to query variables y
under a given evidence x.
Compute: arg max y P( y | x) (NP-complete for MaxSAT)
Marginal/Success Probabilities Probability that query y is true in a random world under a given evidence x. Compute: ∑y P( y | x) (#P-complete
already for conjunctive queries)
✔
Probabilistic & Temporal Database
Sequenced Semantics & Snapshot Reducibility: Built-in semantics: reduce temporal-relational operators to
their non-temporal counterparts at each snapshot of the database.
Coalesce/split tuples with consecutive time intervals based on their lineages.
Non-Sequenced Semantics Queries can freely manipulate timestamps just like regular
attributes. Single temporal operator ≤T supports all of Allen’s 13 temporal relations.
Deduplicate tuples with overlapping time intervals based on their lineages.
A temporal-probabilistic database DTp (compactly) encodes a probability distribution over a finite set of deterministic database instances Di and a finite time
domain T.BornIn(Sub,Obj)
T p
DeNiro Green- which
[1943, 1944)
0.9
DeNiro Tribeca [1998, 1999)
0.6[Dignös, Gamper, Böhlen: SIGMOD’12]
[Dylla,Miliaraki,Theobald: PVLDB’13]
Wedding(Sub,Obj)
T p
DeNiro Abbott [1936, 1940)
0.3
DeNiro Abbott [1976, 1977)
0.7
Divorce(Sub,Obj)
T p
DeNiro Abbott [1988, 1989)
0.8
Temporal Alignment & Deduplication
Non-Sequenced Semantics:
f1
1936 1976 1988
f2 ¬f3
f2 f3f1 ¬f3
f1 f3
(f1 f3) (f1 ¬f3)
(f1 f3) (f1 ¬f3) (f2 f3) (f2 ¬f3)
(f1 f3) (f2 ¬f3)
T
MarriedTo(X,Y)[Tb1,tmax) Wedding(X,Y)[Tb1,Te1) ¬Divorce(X,Y)[Tb2,Te2)MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2
BaseFacts
DeducedFacts
Dedupl.Facts
Wedding(DeNiro,Abbott)Wedding(DeNiro,Abbott)
Divorce(DeNiro,Abbott)
tmax
f2
f3
tmin
0.08 0.120.16
0.4 0.6
‘03 ‘05 ‘07playsFor(Beckham, Real, T1)
Base Facts
DerivedFacts
0.20.20.1
0.4
‘05‘00 ‘02 ‘07playsFor(Ronaldo, Real, T2)
‘04
‘03 ‘04 ‘07‘05
playsFor(Beckham, Real, T1) playsFor(Ronaldo, Real, T2) overlaps(T1,T2, T3)
t3 teamMates(Beckham, Ronaldo, t3)
teamMates(Beckham, Ronaldo, T3)
Inference in Temporal-Probabilistic Databases
[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]
0.4 0.6
‘03 ‘05 ‘07playsFor(Beckham, Real, T1)
Base Facts
DerivedFacts
playsFor(Ronaldo, Real, T2)
0.20.20.1
‘05‘00 ‘02 ‘07‘04
0.4
0.08 0.120.16
‘03 ‘04 ‘07‘05
playsFor(Zidane, Real, T3)
teamMates(Beckham, Zidane, T5)
teamMates(Ronaldo, Zidane, T6)
teamMates(Beckham, Ronaldo, T4)
Non-independentIndependent
Inference in Temporal-Probabilistic Databases
[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]
playsFor(Beckham, Real, T1)Base Facts
DerivedFacts
playsFor(Ronaldo, Real, T2)
playsFor(Zidane, Real, T3)
teamMates(Beckham, Zidane, T5)
teamMates(Ronaldo, Zidane, T6)
Non-independentIndependent
Closed and complete representation model (incl. lineage)
Temporal alignment is linear in the number of input intervals
Confidence computation per interval remains #P-hard
In general requires Monte Carlo approximations (Luby-Karp for DNF, MCMC-style sampling), decompositions, or top-k pruning
teamMates(Beckham, Ronaldo, T4)
Need
Lineage!
Inference in Temporal-Probabilistic Databases
[Wang,Yahya,Theobald: MUD’10; Dylla,Miliaraki,Theobald: PVLDB’13]
Experiment (III): Temporal Alignment & Probabilistic
Inference
1,827 base facts with temporal annotations Extracted from free-text biographies from Wikipedia, IMDB.com,
biography.com 11 handcrafted temporal deduction rules, e.g.: MarriedTo(X,Y)[Tb1,Te2) Wedding(X,Y)[Tb1,Te1) Divorce(X,Y)[Tb2,Te2) Te1 ≤T Tb2
21 handcrafted temporal consistency constraints, e.g.: BornIn(X,Y)[Tb1,Te1) MarriedTo(X,Y)[Tb2,Te2) Te1 ≤T Tb2
Statistical Relational Learning& Probabilistic Programming
SRL combines first-order logic and probabilistic inference
Employs relational data as input, but with a focus also on learning the relations (facts, rules & weights)
Knowledge compilation for probabilistic inference Including recent techniques for “lifted inference”
Markov Logic Networks (U-Washington) Grounding of weighted first-order rules over a function-
free Herbrand base into an undirected graphical model ( Markov Random Field)
Probabilistic Programming (ProbLog, KU-Leuven) Deductive grounding over a set of base facts into a
directed graphical model (SLD proofs Bayesian Net)
Learning Soft Deduction Rules
Inductive learning algorithm based on dynamic programming
A-priori-style pre-filtering & pruning of low-support join patterns
Adaptation of confidence and support measures from data mining
Learning “interesting” rules with constants and type constraints
Ground truth for IivesIn (only partially known)Knowledge base for livesIn (known positive examples)Facts inferred for livesIn from the body of the rule bornIn (only partially correct)
Goal: Inductively learn soft rule S: livesIn(x,y) :- bornIn(x,y)
GKB
R
||||)|()(
BodyBodyHeadBodyHeadPSconfidence
Learning “Interesting” Deduction Rules (I)
Plots for the distribution of income versus quarterOfBirth and educationLevel over actual US census data from Oct. 2009 (>1 billion RDF facts).
Divergence from “Overall population” shows strong correlation of income with educationLevel but not with quarterOfBirth.
income
re/.
freq
.
Overall populationQOB-1st-quarterQOB-2nd-quarterQOB-3rd-quarterQOB-4th-quarter
incomere
/. fr
eq.income(x, y), quarterOfBirth(x, z) income(x, y), educationLevel(x, z)
Learning “Interesting” Deduction Rules (II)
Divergence measured using Kullback-Leibler or χ2 between “Overall population” with “Nursery school to Grade 4” and “Professional school degree” over discretized income domain.
re/.
freq
.
low medium high
income(x, y) :- educationLevel(x, z)
income(x, “low”) :- educationLevel(x, “Nursery school to Grade 4”)
income(x, “medium”) :- educationLevel(x, “Professional school degree”)
income(x, “high”) :- educationLevel(x, “Professional school degree”)
– Overall population– Nursery school to Grade 4– Professional school degree
income
ontological rigor
human effort
Names & PatternsEntities & Relations
Open-Domain & Unsuper-vised
Domain-OrientedTrainingData/Facts
< „N. Portman“, „honored with“, „Academy Award“>, < „Jeff Bridges“, „expected to win“, „Oscar“ > < „Bridges“, „nominated for“, „Academy Award“>
wonAward: Person Prizetype (Meryl_Streep, Actor)wonAward (Meryl_Streep, Academy_Award)
wonAward (Natalie_Portman, Academy_Award)wonAward (Ethan_Coen, Palme_d‘Or)
Summary & Challenges (I)Web-Scale Information Extraction
ontological rigor
human effort
Names & PatternsEntities & Relations
Open-Domain & Unsuper-vised
Domain-OrientedTrainingData/Facts
Summary & Challenges (I)Web-Scale Information Extraction
TextRunner
ReadTheWeb / NELL
Probase
Freebase
YAGO2DBpedia 3.8
Sofie /Prospera
StatSnowball /EntityCube
?
-----
WebTables /FusionTables
Summary & Challenges (II)RDF is Not Enough!
HMM’s, CRF’s, PCFG’s (not in this talk) yield much richer output structures than just triplets.
Extraction of facts beliefs, modifiers,
modalities, etc.. intensional knowledge
(“rules”) More expressive but
canonical representation of natural language: trees, graphs, objects, frames (F-logic, KL-one, CycL, OWL, etc.)
All combined with structured probabilistic inference
Summary & Challenges (III)Scalable Probabilistic Inference“Domain-liftable” FO formula
X,YPeople smokes(X) friends(X,Y) smokes(Y)
Exact lifted inference via Weighted-First-Order-Model-Counting (WFOMC) Probability of a query depends only on the size(s) of the domain(s), a
weight function for the first-order predicates, and the weighted model count over the FO d-DNNF.
[Van den Broeck’11]: Compilation rules and inference algorithms for FO d-DNNF’s[Jha & Suciu’11]: Classes of SQL queries which admit polynomial-size (propositional) d-DNNF’s
Approximate inference via Belief Propagation, MCMC-style sampling, etc.
Scale-out via distributed grounding & inference: TrinityRDF (MSR), GraphLab2 (MIT)
CorrespondingFO d-DNNF circuit
Final Summary
Text is not just unstructured data.
Probabilistic databases combine first-order logic and probability theory in an elegant way.
Natural-Language-Processing people, Database guys, and Machine-Learning folks: it’s about time to join your forces!
C1
C2
C3
References Maximilian Dylla, Iris Miliaraki, and Martin Theobald: A Temporal-Probabilistic Database Model
for Information Extraction. PVLDB 6(14), 2013 (to appear) Maximilian Dylla, Iris Miliaraki, and Martin Theobald: Top-k Query Processing in Probabilistic
Databases with Non-Materialized Views. ICDE 2013, 2013 Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time
Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS 2012: 15-20 Mohamed Yahya, Martin Theobald: D2R2: Disk-Oriented Deductive Reasoning in a RISC-Style
RDF Engine. RuleML America 2011: 81-96 Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive Reasoning in Uncertain RDF
Knowledge Bases. CIKM 2011: 2557-2560 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Scalable Knowledge Harvesting
with High Precision and High Recall. WSDM 2011: 227-236 Maximilian Dylla, Mauro Sozio, Martin Theobald: Resolving Temporal Conflicts in Inconsistent
RDF Knowledge Bases. BTW 2011: 474-493 Yafang Wang, Mohamed Yahya, Martin Theobald: Time-aware Reasoning in Uncertain
Knowledge Bases. MUD 2010: 51-65 Ndapandula Nakashole, Martin Theobald, Gerhard Weikum: Find your Advisor: Robust
Knowledge Gathering from the Web. WebDB 2010 Anish Das Sarma, Martin Theobald, Jennifer Widom: LIVE: A Lineage-Supported Versioned
DBMS. SSDBM 2010: 416-433 Anish Das Sarma, Martin Theobald, Jennifer Widom: Exploiting Lineage for Confidence
Computation in Uncertain and Probabilistic Databases. ICDE 2008: 1023-1032 Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom:
Databases with uncertainty and lineage. VLDB J. 17(2): 243-264 (2008)
top related