gerhard weikum max planck institute for informatics weikum/ from information to knowledge harvesting...

72
Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/ ~weikum/ From Information to Knowledge Harvesting Entities and Relationships From Web Sources Martin Theobald Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/ ~mtb/

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Gerhard Weikum Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~weikum/

From Information to KnowledgeHarvesting Entities and RelationshipsFrom Web Sources

Martin Theobald Max Planck Institute for Informaticshttp://www.mpi-inf.mpg.de/~mtb/

Goal: Turn Web into Knowledge Base

comprehensive DB of human knowledge• everything that Wikipedia knows• everything machine-readable• capturing entities, classes, relationships

Source: DB & IR methods for knowledge discovery.Communications ofthe ACM 52(4), 2009

Approach: Harvesting Facts from WebPolitician Political Party

Angela Merkel CDU

Karl-Theodor zu Guttenberg CDU

Christoph Hartmann FDP

Company CEO

Google Eric Schmidt

Yahoo Overture

Facebook FriendFeed

Software AG IDS Scheer

Movie ReportedRevenue

Avatar $ 2,718,444,933

The Reader $ 108,709,522

Facebook FriendFeed

Software AG IDS Scheer

PoliticalParty Spokesperson

CDU Philipp Wachholz

Die Grünen Claudia Roth

Facebook FriendFeed

Software AG IDS Scheer

Actor Award

Christoph Waltz Oscar

Sandra Bullock Oscar

Sandra Bullock Golden Raspberry

Politician Position

Angela Merkel Chancellor Germany

Karl-Theodor zu Guttenberg Minister of Defense Germany

Christoph Hartmann Minister of Economy Saarland

Company AcquiredCompany

Google YouTube

Yahoo Overture

Facebook FriendFeed

Software AG IDS Scheer

YAGO-NAGA

IWP

Cyc TextRunner

ReadTheWeb

Knowledge as Enabling Technology

• entity recognition & disambiguation• understanding natural language & speech• knowledge services & reasoning for semantic apps (e.g. deep QA)

• semantic search: precise answers to advanced queries (by scientists, students, journalists, analysts, etc.)

Indy 500 winners who are still alive?

Politicians who are also scientists?

Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure?

...

US president when Barack Obama was born?

Relationship between Angela Merkel, Jim Gray, Dalai Lama?

5/54

Knowledge Search (1)Who wasUS presidentwhen Barack Obamawas born?

http://www.wolframalpha.com

6/54

Knowledge Search (1)

http://www.wolframalpha.com

Who wasmayor of Indianapoliswhen Barack Obamawas born?

not enoughfacts in KB !

7/54

Knowledge Search (2)

http://www.google.com/squared/

Indy500 winners?

8/54

Knowledge Search (2)

http://www.google.com/squared/

Indy500 winners?

9/54

Knowledge Search (2)

http://www.google.com/squared/

Indy500 winnersfromEurope?

no typesno inference !

YAGO-NAGA

Related Work

communities

KylinKOG

Cyc

Freebase

CimpleDBlife

UIMA

DBpedia

Yago-Naga

StatSnowballEntityCube

AvatarSystem T

Powerset

START

ontologiesinformationextraction

Answers

SWSE

Hakia

TextRunner

TrueKnowledge

WolframAlpha

Text2Onto

sig.ma

kosmixKnowItAll

(Semantic Web)

(Statistical Web)

(Social Web)

ReadTheWeb

GoogleSquared

10/38

Cyc TextRunnerIWP

WebTables

WorldWideTables

PSOX

EntityRankCazoodle

Outline

...

Framework

Entities and Classes

Relationships

Temporal Knowledge

What and Why

Wrap-up

Framework: Types of Knowledge

...

• facts / assertions: bornIn (JohnDillinger, Indianapolis)

hasWon (JimGray, TuringAward), …• taxonomic: instanceOf (JohnDillinger, bankRobbers),

subclassOf (bankRobbers, criminals), …• lexical / terminology: means (“Big Apple“, NewYorkCity),

means (“Big Mike“, MichaelStonebraker) means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis) …• common-sense properties: apples are green, red, juicy, sweet, sour … - but not fast, smart … balls are round, smooth, slippery … - but not square, funny …• common-sense axioms: x: human(x) male(x) female(x) x: (male(x) female(x)) (female(x) ) male(x)) x: animal(x) (hasLegs(x) isEven(numberOfLegs(x)) …• procedural: how to fix/install/prepare/remove …• epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)),

believes (Copernicus, shape(Earth, sphere)) …

Framework: Information Extraction (IE)

many sources

one source

Surajit obtained hisPhD in CS from Stanford Universityunder the supervision of Prof. Jeff Ullman.He later joined HP andworked closely withUmesh Dayal …

source-centric IE

instanceOf (Surajit, scientist)inField (Surajit, computer science)hasAdvisor (Surajit, Jeff Ullman)almaMater (Surajit, Stanford U)workedFor (Surajit, HP)friendOf (Surajit, Umesh Dayal)…

yield-centricharvesting

Student AdvisorhasAdvisor

Student UniversityalmaMater

Student Advisor

1) recall !2) precision

1) precision !2) recall

near-humanquality !

Student AdvisorSurajit Chaudhuri Jeffrey UllmanAlon Halevy Jeffrey UllmanJim Gray Mike Harrison … …

Student UniversitySurajit Chaudhuri Stanford UAlon Halevy Stanford UJim Gray UC Berkeley … …

Framework: Knowledge Representation

...

• RDF (Resource Description Framework, W3C): subject-property-object (SPO) triples, binary relations structure, but no (prescriptive) schema• Relations, frames• Description logics: OWL, DL-lite• Higher-order logics, epistemic logics

temporal & provenance annotationscan refer to reified facts via fact identifiers(approx. equiv. to RDF quadruples: “Color“ Sub Prop Obj)

facts (RDF triples): (JimGray, hasAdvisor, MikeHarrison)

(SurajitChaudhuri, hasAdvisor, JeffUllman)

(Madonna, marriedTo, GuyRitchie)

(NicolasSarkozy, marriedTo, CarlaBruni)

facts (RDF triples)1:

2:

3:

4:

facts about facts:5: (1, inYear, 1968)

6: (2, inYear, 2006)

7: (3, validFrom, 22-Dec-2000)

8: (3, validUntil, Nov-2008)

9: (4, validFrom, 2-Feb-2008)

10: (2, source, SigmodRecord)

http://www.mpi-inf.mpg.de/yago-naga/

KB‘s: Example YAGO (Suchanek et al.: WWW‘07)Entity

Max_Planck

Apr 23, 1858

Person

City

Countrysubclass

Locationsubclass

instanceOf

subclass

bornOn

“Max Planck”

means(0.9)

subclass

Oct 4, 1947 diedOn

Kiel

bornInNobel Prize

Erwin_Planck

FatherOfhasWon

Scientist

means

“Max Karl Ernst Ludwig Planck”

Physicist

instanceOf

subclassBiologist

subclass

Germany

Politician

Angela Merkel

Schleswig-Holstein

State

“Angela Dorothea Merkel”

Oct 23, 1944diedOn

Organization

subclass

Max_Planck Society

instanceOf

means(0.1)

instanceOfinstanceOf

subclass

subclass

means

“Angela Merkel”

means

citizenOf

instanceOfinstanceOf

locatedIn

locatedIn

subclass

Accuracy 95%

2 Mio. entities, 20 Mio. facts 40 Mio. RDF triples ( entity1-relation-entity2, subject-predicate-object )

KB‘s: Example YAGO (F. Suchanek et al.: WWW‘07)

http://www.mpi-inf.mpg.de/yago-naga/

KB‘s: Example DBpedia (Auer, Bizer, et al.: ISWC‘07)

• 3 Mio. entities, • 1 Bio. facts (RDF triples)• 1.5 Mio. entities mapped to hand-crafted taxonomy of 259 classes with 1200 properties

http://www.dbpedia.org

Outline

...

Framework

Entities and Classes

Relationships

Temporal Knowledge

What and Why

Wrap-up

Entities & Classes

...

Which entity types (classes, unary predicates) are there?

Which subsumptions should hold(subclass/superclass, hyponym/hypernym, inclusion dependencies)?

Which individual entities belong to which classes?

Which names denote which entities?

scientists, doctoral students, computer scientists, …female humans, male humans, married humans, …

subclassOf (computer scientists, scientists),subclassOf (scientists, humans), …

instanceOf (Surajit Chaudhuri, computer scientists),instanceOf (BarbaraLiskov, computer scientists),instanceOf (Barbara Liskov, female humans), …

means (“Lady Di“, Diana Spencer),means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), …means (“Madonna“, Madonna Louise Ciccone),means (“Madonna“, Madonna(painting by Edward Munch)), …

WordNet Thesaurus [Miller/Fellbaum 1998]

http://wordnet.princeton.edu/

3 concepts / classes & their synonyms (synset‘s)

WordNet Thesaurus [Miller/Fellbaum 1998]

http://wordnet.princeton.edu/

subclasses(hyponyms)

superclasses(hypernyms)

WordNet Thesaurus [Miller & Fellbaum 1998]

scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist ... => principal investigator, PI …HAS INSTANCE => Bacon, Roger Bacon …

but: only few individual entities (instances of classes)

> 100 000 classes and lexical relations;can be cast into • description logics or • graph, with weights for relation strengths (derived from co-occurrence statistics)

http://wordnet.princeton.edu/

Tapping on Wikipedia Categories

Tapping on Wikipedia Categories

Mapping: Wikipedia WordNet[Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07]

Jim Gray(computer specialist)

ComputerScientist

American

Scientist

Sailor,Crewman

MissingPerson

Chemist

Artist

American

Sailor,Crewman

Mapping: Wikipedia WordNet[Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07]

Jim Gray(computer specialist)

ComputerScientist

Data-base

Fellow (1), Comrade

Fellow (2),Colleague

Fellow (3)(of Society)

Scientist

Member (1),Fellow

Member (2),Extremity

AmericanComputerScientists

DatabaseResearcher

Fellows ofthe ACM

PeopleLost at Sea

instanceOf

subclassOf

?

?

?

name similarity(edit dist., n-gram overlap) ?context similarity(word/phrase level) ?

machine learning ?

ComputerScientistsby Nation

Databases

ACM

Members of LearnedSocieties

EngineeringSocieties

?

?

?

MissingPerson

Mapping: Wikipedia WordNet[Suchanek: WWW‘07, Ponzetto & Strube:AAAI‘07]

Analyzing category names noun group parser:

American Musicians of Italian Descent

American Folk Music of the 20th Century

American Indy 500 Drivers on Pole Positions

Head word is key, should be in plural for instanceOf

headpre-modifier post-modifier

headpre-modifier post-modifier

headpre-modifier post-modifier

Given: entity e in Wikipedia categories c1, …, ck

Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN class cProblem: vagueness & ambiguity of names c1, …, ck

Mapping Wikipedia Entities to WordNet Classes

Given: entity e in Wikipedia categories c1, …, ck

Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN class cProblem: vagueness & ambiguity of names c1, …, ck

Heuristic Method:for each ci do if head word w of category name ci is plural { 1) match w against synsets of WordNet classes 2) choose best fitting class c and set e c 3) expand w by pre-modifier and set ci w+ c }

• can also derive features this way • feed into supervised classifier

[Suchanek: WWW‘07, Ponzetto & Strube: AAAI‘07]

tuned conservatively: high precision, reduced recall

Learning More Mappings [ Wu & Weld: WWW‘08 ]

Kylin Ontology Generator (KOG):learn classifier for subclassOf across Wikipedia & WordNet using

• YAGO as training data• advanced ML methods (MLN‘s, SVM‘s)• rich features from various sources

• category/class name similarity measures• category instances and their infobox templates: template names, attribute names (e.g. knownFor)• Wikipedia edit history: refinement of categories• Hearst patterns: C such as X, X and Y and other C‘s, …• other search-engine statistics: co-occurrence frequencies

> 3 Mio. entities> 1 Mio. w/ infoboxes> 500 000 categories

Goal: Comprehensive & Consistent !

Jim Gray(computer specialist)

Madonna(entertainer)

JeffreyUllman

Bob Dylan

……

AmericanComputerScientists

DatabaseResearcher

Fellows ofthe ACM

Databases

Members of LearnedSocieties

Artist

Singer

Italian

American

Musician

Born

AwardWinner

Scientist

KnownFor

AlmaMater

NotableAwards

DoctoralStudents

Academic

Bell LabsPrincetonAlumni

Knuth PrizeLaureate

AmericanPeople byOccupation

Fellow(1)

Fellow(2)

World Record Holders

AmericanSongwriters

AthleteGenres

YearsActive

Hall of FameInductees

U MichiganAlumni

AlsoKnownAs

WebsiteGuitar Players

Americans ofItalian Descent

Peopleby Status

ComputerData

Telecomm.History

Goal: Comprehensive & Consistent !

Jim Gray(computer specialist)

Madonna(entertainer)

JeffreyUllman

Bob Dylan

……

AmericanComputerScientists

DatabaseResearcher

Fellows ofthe ACM

Databases

Members of LearnedSocieties

Artist

Singer

Italian

American

Musician

Born

AwardWinner

Scientist

KnownFor

AlmaMater

NotableAwards

DoctoralStudents

Academic

Bell LabsPrincetonAlumni

Knuth PrizeLaureate

AmericanPeople byOccupation

Fellow(1)

Fellow(2)

World Record Holders

AmericanSongwriters

AthleteGenres

YearsActive

Hall of FameInductees

U MichiganAlumni

AlsoKnownAs

WebsiteGuitar Players

Americans ofItalian Descent

Peopleby Status

ComputerData

Telecomm.History

Goal: Comprehensive & Consistent !

Jim Gray(computer specialist)

Madonna(entertainer)

JeffreyUllman

Bob Dylan

……

AmericanComputerScientists

DatabaseResearcher

Fellows ofthe ACM

Databases

Members of LearnedSocieties

Artist

Singer

Italian

American

Musician

Born

AwardWinner

Scientist

KnownFor

AlmaMater

NotableAwards

DoctoralStudents

Academic

Bell LabsPrincetonAlumni

Knuth PrizeLaureate

AmericanPeople byOccupation

Fellow(1)

Fellow(2)

World Record Holders

AmericanSongwriters

AthleteGenres

YearsActive

Hall of FameInductees

U MichiganAlumni

AlsoKnownAs

WebsiteGuitar Players

Americans ofItalian Descent

Peopleby Status

ComputerData

Telecomm.History

Goal: Comprehensive & Consistent !

Jim Gray(computer specialist)

Madonna(entertainer)

JeffreyUllman

Bob Dylan

……

AmericanComputerScientists

DatabaseResearcher

Fellows ofthe ACM

Databases

Members of LearnedSocieties

Artist

Singer

Italian

American

Musician

Born

AwardWinner

Scientist

KnownFor

AlmaMater

NotableAwards

DoctoralStudents

Academic

Bell LabsPrincetonAlumni

Knuth PrizeLaureate

AmericanPeople byOccupation

Fellow(1)

Fellow(2)

World Record Holders

AmericanSongwriters

AthleteGenres

YearsActive

Hall of FameInductees

U MichiganAlumni

AlsoKnownAs

WebsiteGuitar Players

Americans ofItalian Descent

Peopleby Status

ComputerData

Telecomm.History

Clean up the mess:• graph algorithms ?

• random walk with restart• dense subgraphs …

• statistical machine learning ?• logical consistency reasoning ?• gigantic schema integration ?

• ontology merging

Long Tail of Class Instances

Long Tail of Class Instances[Etzioni et al. 2004, Cohen et al. 2008, Mitchell et al. 2010]

But:Precision drops for classes with sparse statistics (DB profs, …)Harvested items are names, not entitiesCanonicalization (de-duplication) unsolved

State-of-the-Art Approach (e.g. SEAL):• Start with seeds: a few class instances• Find lists, tables, text snippets (“for example: …“), … that contain one or more seeds• Extract candidates: noun phrases from vicinity• Gather co-occurrence stats (seed&cand, cand&className pairs)• Rank candidates

• point-wise mutual information, …• random walk (PR-style) on seed-cand graph

Individual Entity Disambiguation

“Penn“

“U Penn“University of Pennsylvania

“Penn State“PennsylvaniaState University

„PSU“Pennsylvania(US State)

Sean Penn

PassengerService Unit

Names Entities

??

• ill-defined with zero context• known as record linkage for names in record fields• Wikipedia offers rich candidate mappings: disambiguation pages, re-directs, inter-wiki links, anchor texts of href links

Collective Entity Disambiguation

• Consider a set of names {n1, n2, …} in same context

and sets of candidate entities E1 = {e11, e12, …}, E2 = {e21, e22, …}, …• Define joint objective function (e.g. likelihood for prob. model)

that rewards coherence of mappings ni eij

[McCallum 2003, Doan 2005, Getoor 2006. Domingos 2007, Chakrabarti 2009, …]

• Solve optimization problem

Stuart Russell

Michael Jordan

Stuart Russell(computer scientist)

Stuart Russell (DJ)

Michael Jordan(computer scientist)

Michael Jordan (NBA)

Problems and ChallengesWikipedia categories reloaded

Robust disambiguation

Tags, tables, topics

Long tail of entities

comprehensive & consistent instanceOf and subClassOfacross Wikipedia and WordNet (via consistency reasoning ?)

tap on other sources: Web2.0, Web tables, directories, etc.

near-real-time mapping of names to entitieswith near-human quality

discover new entities, detect new names for known entities

beyond Wikipedia: domain-specific entity catalogs

Outline

...

Framework

Entities and Classes

Relationships

Temporal Knowledge

What and Why

Wrap-up

RelationshipsWhich instances (pairs of individual entities) are therefor given binary relations with specific type signatures?

hasAdvisor (JimGray, MikeHarrison)hasAdvisor (HectorGarcia-Molina, Gio Wiederhold)hasAdvisor (Susan Davidson, Hector Garcia-Molina)graduatedAt (JimGray, Berkeley)graduatedAt (HectorGarcia-Molina, Stanford)hasWonPrize (JimGray, TuringAward)bornOn (JohnLennon, 9Oct1940)diedOn (JohnLennon, 8Dec1980)marriedTo (JohnLennon, YokoOno)

Which additional & interesting relation types are there between given classes of entities?

competedWith(x,y), nominatedForPrize(x,y), …divorcedFrom(x,y), affairWith(x,y), …assassinated(x,y), rescued(x,y), admired(x,y), …

Picking Low-Hanging Fruit (First)

Deterministic Pattern Matching

...

[Kushmerick 97, Califf & Mooney 99, Gottlob 01, …]

• Regular expressions matching• Wrapper induction (grammar learning for restricted regular languages)• Well understood

French Marriage Problem

facts in KB: new facts or fact candidates:

married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)

married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Michelle, Barack)married (Yoko, John)married (Kate, Leonardo)married (Carla, Sofie)married (Larry, Google)

1) for recall: pattern-based harvesting2) for precision: consistency reasoning

Pattern-Based Harvesting

Facts Patterns

(Hillary, Bill)

(Carla, Nicolas)

& Fact Candidates

X and her husband Y

X and Y on their honeymoon

X and Y and their children

X has been dating with Y

X loves Y

… • good for recall• noisy, drifting• not robust enough for high precision

(Angelina, Brad)

(Hillary, Bill)(Victoria, David)

(Carla, Nicolas)

(Angelina, Brad)

(Yoko, John)

(Carla, Benjamin)(Larry, Google)

(Kate, Pete)

(Victoria, David)

(Hearst 92, Brin 98, Agichtein 00, Etzioni 04, …)

Reasoning about Fact Candidates Use consistency constraints to prune false candidates

spouse(Hillary,Bill)spouse(Carla,Nicolas)spouse(Cecilia,Nicolas)spouse(Carla,Ben)spouse(Carla,Mick)Spouse(Carla, Sofie)

spouse(x,y) diff(y,z) spouse(x,z)

f(Hillary)f(Carla)f(Cecilia)f(Sofie)

m(Bill)m(Nicolas)m(Ben)m(Mick)

spouse(x,y) f(x) spouse(x,y) m(y)

spouse(x,y) (f(x)m(y)) (m(x)f(y))

FOL rules (restricted): ground atoms:

Rules can be weighted(e.g. by fraction of ground atoms that satisfy a rule) uncertain / probabilistic data compute prob. distr. of subset of atoms being the truth

Rules reveal inconsistenciesFind consistent subset(s) of atoms(“possible world(s)“, “the truth“)

spouse(x,y) diff(w,y) spouse(w,y)

Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006)

Map logical constraints & fact candidatesinto probabilistic graph model: Markov Random Field (MRF)

s(x,y) m(y)

s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…

s(x,y) diff(w,y) s(w,y)

s(x,y) f(x)

s(Ca,Nic) s(Ce,Nic)

s(Ca,Nic) s(Ca,Ben)

s(Ca,Nic) s(Ca,So)

s(Ca,Ben) s(Ca,So)

s(Ca,Ben) s(Ca,So)

s(Ca,Nic) m(Nic)

Grounding:

s(Ce,Nic) m(Nic)

s(Ca,Ben) m(Ben)

s(Ca,So) m(So)

f(x) m(x)

M(x) f(x)

Literal Boolean VarLiteral binary RV

Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006)

Map logical constraints & fact candidatesinto probabilistic graph model: Markov Random Field (MRF)

s(x,y) m(y)

s(x,y) diff(y,z) s(x,z) s(Carla,Nicolas)s(Cecilia,Nicolas)s(Carla,Ben)s(Carla,Sofie)…

s(x,y) diff(w,y) s(w,y)

s(x,y) f(x) f(x) m(x)

M(x) f(x)

m(Ben)

m(Nic) s(Ca,Nic)

s(Ce,Nic)

s(Ca,Ben)

s(Ca,So) m(So)

RVs coupledby MRF edgeif they appearin same clause

MRF assumption:P[Xi|X1..Xn]=P[Xi|N(Xi)]

Variety of algorithms for joint inference:Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, …

joint distribution has product form over all cliques

Related Alternative Probabilistic Models

software tools: alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/

Constrained Conditional Models [D. Roth et al. 2007]

Factor Graphs with Imperative Variable Coordination [A. McCallum et al. 2008]

log-linear classifiers with constraint-violation penaltymapped into Integer Linear Programs

RV‘s share “factors“ (joint feature functions)generalizes MRF, BN, CRF, …inference via advanced MCMCflexible coupling & constraining of RV‘s

m(Ben)

m(Nic) s(Ca,Nic)

s(Ce,Nic)

s(Ca,Ben)

s(Ca,So) m(So)

Reasoning for KB Growth: Direct Route

facts in KB:new fact candidates:

married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)

married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)married (Carla, Sofie)married (Larry, Google)

+

patterns:X and her husband YX and Y and their childrenX has been dating with YX loves Y

?

• facts are true; fact candidates & patterns hypotheses• grounded constraints clauses with hypotheses as vars• cast into Weighted Max-Sat with weights from pattern stats• customized approximation algorithm• unifies: fact cand consistency, pattern goodness, entity disambig.

(F. Suchanek et al.: WWW‘09)

www.mpi-inf.mpg.de/yago-naga/sofie/

Direct approach:

Facts & Patterns Consistency

constraints to connect facts, fact candidates, patterns(F. Suchanek et al.: WWW‘09)

functional dependencies:spouse(X,Y): X Y, Y X

relation properties:asymmetry, transitivity, acyclicity, …

type constraints, inclusion dependencies:spouse Person Person capitalOfCountry cityOfCountry

domain-specific constraints:bornInYear(x) + 10years ≤ graduatedInYear(x)

www.mpi-inf.mpg.de/yago-naga/sofie/

hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t

pattern-fact duality:

occurs(p,x,y) expresses(p,R) R(x,y)

name(-in-context)-to-entity mapping:

means(n,e1) means(n,e2) …

occurs(p,x,y) R(x,y) expresses(p,R)

Soft Rules vs. Hard Constraints

Enforce FD‘s (mutual exclusion) as hard constraints:

Generalize to other forms of constraints:

hard constraint soft constraint

hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t

firstPaper(x,p) firstPaper(y,q) author(p,x) author(p,y) ) inYear(p) > inYear(q) + 5years hasAdvisor(x,y)

hasAdvisor(x,y) diff(y,z) hasAdvisor(x,z)

combine with weighted constraintsno longer MaxSatconstrained MaxSat instead

open issue for arbitrary constraints rethink reasoning !

Problems and ChallengesHigh precision & high recall at affordable cost

Scale, dynamics, life-cycle

Declarative, self-optimizing workflows

Types and constraints

robust pattern analysis & reasoning

incorporate pattern & reasoning steps into IE queries/programs

grow & maintain KB with near-human-quality over long periods

explore & understand different families of constraints

soft rules & hard constraints, rich DL, beyond CWA

parallel processing, lazy / lifted inference, …

Open-domain knowledge harvestingturn names, phrase & table cells into entities & relations

Outline

...

Framework

Entities and Classes

Relationships

Temporal Knowledge

What and Why

Wrap-up

Temporal KnowledgeWhich facts for given relations hold at what time point or during which time intervals ?

marriedTo (Madonna, Guy) [ 22Dec2000, Dec2008 ]capitalOf (Berlin, Germany) [ 1990, now ]capitalOf (Bonn, Germany) [ 1949, 1989 ]hasWonPrize (JimGray, TuringAward) [ 1998 ]graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ]graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ]hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ]

How can we query & reason on entity-relationship factsin a “time-travel“ manner - with uncertain/incomplete KB ?

US president when Barack Obama was born?students of Hector Garcia-Molina while he was at Princeton?

French Marriage Problem

facts in KB

new fact candidates:

married (Hillary, Bill)married (Carla, Nicolas)married (Angelina, Brad)

married (Cecilia, Nicolas)married (Carla, Benjamin)married (Carla, Mick)divorced (Madonna, Guy)domPartner (Angelina, Brad)

1:

2:

3:

validFrom (2, 2008)

validFrom (4, 1996) validUntil (4, 2007)validFrom (5, 2010)validFrom (6, 2006)validFrom (7, 2008)

4: 5:6:7:8:

JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC

Challenge: Temporal Knowledgefor all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night

consistency constraints are potentially helpful:• functional dependencies: husband, time wife• inclusion dependencies: marriedPerson adultPerson• age/time/gender restrictions: birthdate + < marriage < divorce

1) recall: gather temporal scopes for base facts2) precision: reason on mutual consistency

Difficult Dating

(Even More Difficult) Implicit Datingexplicit dates vs.implicit dates relative to other dates

(Even More Difficult) Relative Datingvague dates relative datesvague dates relative dates

narrative textrelative ordernarrative textrelative order

TARSQI: Extracting Time Annotations

Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A pro-democracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE" VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re-election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the <TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3> election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0" endPoint="t8“ TYPE="DURATION" VAL="P5Y">another five years </TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.

(M. Verhagen et al.: ACL‘05)http://www.timeml.org/site/tarsqi/

extractionerrorsextractionerrors

Representing Time: AI Perspective

• Instant– durationless piece of time

• Period– potentially unbounded continuum of instants

• Events– time as a sequence of events E– precedence and overlap relations on E E

[Allen 1984, Allen & Hayes 1989, …]

Relations between Time Periods

A Before B B After A

A Meets B B MetBy A

A Overlaps B B OverlappedBy A

A Starts B B StartedBy A

A During B B Contains A

A Finishes B B FinishedBy A

A Equal B

A B

AB

AB

AB

A

B

AB

AB

Representing Time: DB Perspective• Time point: smallest time unit of fixed duration/granularity (e.g., a day, a year, a second)

• Interval: finite set of time points

• State relation:fact holds at every time point within intervalisCapitalOf (Bonn, Germany) [1949, 1989]

• Event relation: fact holds at exactly one time point within interval

wonCup (United, ChampionsLeague) [1999, 1999]

intervals can also capture uncertainty of time points

Uncertainty and Time• Point-probabilities for facts and intervals

playsFor(Beckham, United)[1990, 2005]:0.9– fact valid in interval [tb, te ] with prob. p– fact not valid with prob. 1-p

• Continuous distributionsplaysFor(Beckham, United)

[1990, 2005]:Gauss(µ=1996,σ2=1)

• HistogramsplaysFor(Beckham, United)

[1990, 1992):0.1[1992, 2004):0.6

[2004, 2005]:0.2

0.60.20.1

‘90 ‘92 ‘05‘04

0.9

‘90 ‘05

‘90 ‘96 ‘05

µ=1996σ2=1

0.30.6

Possible Worlds in Time

0.3

State Event

Event

‘95 ‘98 ‘02

‘96 ‘98 ‘00 ‘01

‘96 ‘99 ‘00

‘99

0.54

0.9 1.0

‘01playsFor (Beckham, United) wonCup (United,

ChampionsLeague)

playsFor(Beckham, United)wonCup(United, ChampionsLeague)

Base Facts

hasWon (Beckham, ChampionsLeague)

0.20.5

0.10.2

0.120.30

0.060.06

• #P-complete per histogram bin• linear in #bins

Joint Reasoning on Facts & Time

marriedTo(Nicolas,

Carla)0.91

marriedTo(Nicolas, Cecilia)

0.65

divorcedFrom(Nicolas, Cecilia)

0.78

bornIn(Nicolas,

Paris)

0.77

bornIn(Cecilia,

Boulogne)

0.12

bornIn(Carla, Turin)

0.43

marriedTo(Carla, Ben)

0.18

marriedTo(Carla, Mick)

0.25 marriedTo(a,b,T1) marriedTo(a,c,T2) different(b,c) disjoint(T1,T2)

marriedTo(a,b,T1) divorcedFrom(a,b,T2) before(T1,T2)

marriedTo(a,b,T1) bornIn(a,c,T2) before(T2,T1)

Rules: Facts from KB (with confidence weights):

Joint Reasoning on Facts & Time

bornIn(Nicolas, Paris)

bornIn(Cecilia, Boulogne)

bornIn(Carla, Turin)

m(Nicolas, Cecilia)div(Nicolas, Cecilia)

m(Nicolas, Carla)

m(Carla, Mick)

m(Carla, Ben)

marriedTo(Nicolas,

Carla)

marriedTo(Nicolas, Cecilia)

divorcedFrom(Nicolas, Cecilia)

marriedTo(Carla, Mick)

marriedTo(Carla, Ben)

bornIn(Carla, Turin)

bornIn(Cecilia,

Boulogne)

bornIn(Nicolas,

Paris)

0.91

0.65 0.78

0.77 0.12

0.43

0.18

0.25 marriedTo(a,b,T1) marriedTo(a,c,T2) different(b,c) disjoint(T1,T2)

marriedTo(a,b,T1) divorcedFrom(a,b,T2) before(T1,T2)

marriedTo(a,b,T1) bornIn(a,c,T2) before(T2,T1)

Rules: Facts from KB (with confidence weights):

time

+ more soft rules: hasChild (a,c) hasChild (b,c) different (a,b) marriedTo(a,b)+ recursive rules …

Compute most likely possible world !

Problems and Challenges

Temporal Querying (Revived)

Consistency Reasoning

Incomplete and Uncertain Temporal Scopes

Gathering Implicit and Relative Time Annotations

query language (T-SPARQL?), no schemaconfidence weights & ranking

incorrect, incomplete, unknown begin/endvague dating

biographies & news, relative orderingsaggregate & reconcile observations

extended MaxSat, extended Datalog, prob. graph. models, etc. for resolving inconsistencies on uncertain facts & uncertain time

Outline

...

Framework

Entities and Classes

Relationships

Temporal Knowledge

What and Why

Wrap-up

KB Building: Where Do We Stand?Entities & Classes

Relationships

Temporal Knowledgewidely open (fertile) research ground:

• uncertain / incomplete temporal scopes of facts• joint reasoning on ER facts and time scopes

good progress, but many challenges left:• recall & precision by patterns & reasoning• efficiency & scalability• soft rules, hard constraints, richer logics, …• open-domain discovery of new relation types

strong success story, some problems left:• large taxonomies of classes with individual entities• long tail calls for new methods• entity disambiguation remains grand challenge

Overall Take-Home

...

Historic opportunity: revive Cyc vision, make it real & large-scale !challenging & risky, but high pay-off

Explore & exploit synergies between semantic, statistical, & social Web methods:statistical evidence + logical consistency !

For DB researchers (theoreticians & normal ones):• efficiency & scalability• constraints & reasoning• killer app for uncertain data management• knowledge-base life-cycle: growth & maintenance

Thank You !