joint work with georgiana ifrim, gjergji kasneci, thomas neumann, maya ramanath, fabian suchanek

weikum@mpi-inf.mpg.dehttp://www.mpi-inf.mpg.de/~weikum/

Gerhard Weikum

Harvesting, Searching, and RankingKnowledge from the Web

Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann,Maya Ramanath, Fabian Suchanek

VisionOpportunity: Turn the Web (and Web 2.0 and Web 3.0 ...) intothe world‘s most comprehensive knowledge base

Approach: 1) harvest and combine

a) hand-crafted knowledge sources (Semantic Web, ontologies)

b) automatic knowledge extraction (Statistical Web, text mining)

c) social communities and human computing (Social Web, Web 2.0)

2) express knowledge queries, search, and rank3) everything efficient and scalable

Why Google and Wikipedia Are Not Enough

German universities with world-class computer scientists

German Nobel prize winner who survived both world warsand all of his four children

proteins that inhibit proteases and other human enzymes

Answer „knowledge queries“ such as:

connection between Thomas Mann and Goethe

politicians who are also scientists

Which politiciansare also scientists ?

Why Google and Wikipedia Are Not Enough

What is lacking?Information is not Knowledge.Knowledge is not Wisdom.Wisdom is not TruthTruth is not Beauty.Beauty is not Music.Music is the best.

(Frank Zappa)

extract facts from Web pages capture user intention by concepts, entities, relations

NAGA Example

Query:$x isa politician$x isa scientist

Results:Benjamin FranklinPaul WolfowitzAngela Merkel…

Related Work

semistructured IR& graph search

TextRunner

DBexplorer

Freebase

CimpleDBlife

DBpedia

YagoNaga

SPARQL

Avatar

EntityRank

Powerset

Webentity search& QA

informationextraction &ontologybuilding

Answers

Outline

Motivation

• Information Extraction & Knowledge Harvesting (YAGO)

• Ranking for Search over Entity-Relation Graphs (NAGA)

• Conclusion

• Efficient Query Processing (RDF-3X)

Information Extraction (IE): Text to Records

Max Planck 4/23, 1858 KielAlbert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar

Person BirthDate BirthPlace ...

Person ScientificResult

Max Planck Quantum Theory

Person CollaboratorMax Planck Albert EinsteinMax Planck Niels Bohr

Planck‘s constant 6.2261023 Js

Constant Value Dimension

combine NLP, pattern matching, lexicons, statistical learning

extracted facts often have confidence < 1 DB with uncertainty (probabilistic DB)

expensive anderror-prone

High-Quality Knowledge SourcesGeneral-purpose ontologies and thesauri: WordNet family

scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist ... => principal investigator, PI …HAS INSTANCE => Bacon, Roger Bacon …

200 000 concepts and relations;can be cast into • description logics or • graph, with weights for relation strengths (derived from co-occurrence statistics)

{{Infobox_Scientist| name = Max Planck| birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]]| death_date = [[October 4]], [[1947]]| death_place = [[Göttingen]], [[Germany]]| residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]]| work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]]| alma_mater = [[Ludwig-Maximilians-Universität München]]| doctoral_advisor = [[Philipp von Jolly]]| doctoral_students = [[Gustav Ludwig Hertz]]</br>… | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]]| prizes = [[Nobel Prize in Physics]] (1918)…

Exploit Hand-Crafted KnowledgeWikipedia, WordNet, and other lexical sources

YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, G. Weikum: WWW‘07]

• Turn Wikipedia into explicit knowledge base (semantic DB);

keep source pages as witnesses

• Exploit hand-crafted categories and infobox templates

• Represent facts as explicit knowledge triples:

relation (entity1, entity2)

(in FOL, compatible with RDF, OWL-lite, XML, etc.)

• Map (and disambiguate) relations into WordNet concept DAG

entity1 entity2relation

Max_Planck KielbornIn

Kiel CityisInstanceOf

Examples:

YAGO Knowledge Base [F. Suchanek et al.: WWW’07]

Entity

Max_Planck April 23, 1858

Person

City Country

subclass Location

subclass

instanceOf

subclass subclass

bornOn

“Max Planck”

“Dr. Planck”

subclass

October 4, 1947 diedOn

KielbornInNobel Prize Erwin_Planck

FatherOfhasWon

Scientist

“Max Karl Ernst Ludwig Planck”

Physicist

instanceOf

subclassBiologist

subclass

concepts

individuals

Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/

Accuracy 95% Entities Facts

KnowItAll 30 000SUMO 20 000 60 000WordNet 120 000 80 000Cyc 300 000 5 Mio.TextRunner n/a 8 Mio.YAGO 1.7 Mio. 15 Mio.DBpedia 1.9 Mio. 103 Mio.Freebase ??? ???

Wikipedia Harvesting: Difficulties & Solutions

• instanceOf relation: isleading and difficult category names

(„disputed articles“, „particle physics“, „American Music of the 20th Century“,

„Nobel laureates in physics“, „naturalized citizens of the United States“, … )

noun group parser: ignore when head word in singular• isA relation: mapping categories onto WordNet classes: „Nobel laureates in physics“ Nobel_laureates, „people from Kiel“ person

map to (singular of) head; exploit synsets and statistics• Entity name ambiguities: „St. Petersburg“, „Saint Petersburg“, „M31“, „NGC224“ means ...

exploit Wikipedia redirects & disambiguations, WN synsets• type checking for scrutinizing candidates:

accept fact candidate only if arguments have proper classes marriedTo (Max Planck, quantum physics) Person Person

Higher-Order Facts in YAGO

Berlin GermanyCapitalOf

BonnCapitalOf

1990-2008

validIn

1949-1989

validIn

facts about facts represented by reification as first-order factse314159

e314159 1990-2008validIn

Berlin GermanyCapitalOf

ArnoldSchwarzen-egger

Politician

ActorinstanceOf

instanceOf 1987-2008

validIn

2003-2008validIn

Ongoing Work: YAGO for Easier IEYAGO knows (almost) all (interesting) entitiesleverage for discovering & extracting new facts in NL texts

• can filter out many uninteresting sentences• can quickly identify relation arguments• can eliminate many fact candidates by type checking• can focus on specific properties like time

Seine ParisrunsThrough

river city

Cologne lies on the banks of the Rhine

Ss MVp DMc Mp Dg

NP PPVP NP PP NP NPNP

People in Cairo like wine from the Rhine valley

Sp Mvp DsJs

NP NPPP VP PP NPNP NPNP

IE with dependency parser is expensive !

The city of Paris was founded on an island in the Seine in 300 BC

France

Europe

isa isa

locatedInlocatedInlocatedIn

Outline

Motivation

Information Extraction & Knowledge Harvesting (YAGO)

• Ranking for Search over Entity-Relation Graphs (NAGA)

• Conclusion

NAGA: Graph Search [G. Kasneci et al.: ICDE‘08]

complex queries (with regular expressions)

computerscience $x scientist

isawonPrize

$u universityisa

worksAt |graduatedFrom

discovery queries connectedness queriesThomas MannGoethe * German

novelistisa

politician $x scientistisa isa

Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness

GermanylocatedIn*

$pinField

queries over reified facts Germany

$ccapitalOf isacity

validIn

Search Results Without Rankingq: Fisher isa scientist Fisher isa $x

mathematician_109635652 —subClassOf—> scientist_109871938 Alumni_of_Gonville_and_Caius_College,_Cambridge —subClassOf—> alumnus_109165182 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = alumnus_109165182

$@Fisher = Irving_Fisher $@scientist = scientist_109871938 $X = social_scientist_109927304

$@Fisher = James_Fisher $@scientist = scientist_10981938 $X = ornithologist_109711173

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = theorist_110008610

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = colleague_109301221

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = organism_100003226 …

Ranking with Statistical Language Modelq: Fisher isa scientist Fisher isa $x

Score: 7.184462521168058E-13 mathematician_109635652 —subClassOf—> scientist_109871938 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938 20th_century_mathematicians —subClassOf—> mathematician_109635652

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = mathematician_109635652

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = statistician_109958989

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = president_109787431

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = geneticist_109475749

$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = scientist_109871938 …

Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/

statistical language model for result graphs

Ranking FactorsConfidence:Prefer results that are likely to be correct

Certainty of IE Authenticity and Authority of Sources

Informativeness:Prefer results that are likely importantMay prefer results that are likely new to user

Frequency in answer Frequency in corpus (e.g. Web) Frequency in query log

Compactness:Prefer results that are tightly connected

Size of answer graph

bornIn (Max Planck, Kiel) from„Max Planck was born in Kiel“(Wikipedia)

livesIn (Elvis Presley, Mars) from„They believe Elvis hides on Mars“(Martian Bloggeria)

q: isa (Einstein, $y)

isa (Einstein, scientist)isa (Einstein, vegetarian)

q: isa ($x, vegetarian)

isa (Einstein, vegetarian)isa (Al Nobody, vegetarian)

Einstein

vegetarian

BohrNobel Prize

Tom Cruise

isa isa bornIn

diedInwon

NAGA Ranking Model

iii qPqPqP

|)1(| gg

Following the paradigm of statistical language models(used in speech recognition and modern IR), applied to graphs

For query q with fact templates q1 … qn bornIn ($x, Frankfurt)

rank result graphs g with facts g1 … gn bornIn (Goethe, Frankfurt)

by decreasing likelihoods:usinggenerativemixture model

reflectinformativeness

backgroundmodel

weights subqueriesEx.: bornIn ($x, Germany) & wonAward ($x, Nobel)

NAGA Ranking Model: Informativeness

Estimate P[qi | gi]

)z,r,'x(P

)z,r,x(P

)z,r(P

)z,r,x(P)z,r|x(P

for qi = (x*, r, z) with var x* (analogously for other cases)

Ex.: bornIn ($x, Frankfurt)

Albert Einsteinisa

vegetarian

physicistisa

Ex.: isa (Einstein, $z)

bornIn (GW, Frankfurt)

bornIn (Goethe, Frankfurt)

isa (Einstein, physicist)

bornIn (Einstein, vegetarian)

Estimate on knowledge graph: Estimate on Web(exploit redundancy):

freq (Einstein, isa, physicist)vs.freq (Einstein, isa, vegetarian)

NAGA Example

Query:$x isa politician$x isa scientist

Results:Benjamin FranklinPaul WolfowitzAngela Merkel…

User Study for Quality Assessment (1)Benchmark:• 55 queries from TREC QA 2005/2006 Examples: 1) In what country is Luxor? 2) Discoveries of the 20th Century?• 12 queries from work on SphereSearch Examples: 1) In which movies did a governor act? 2) Firstname of politician Rice?• 18 regular expression queries by us

Example: What do Albert Einstein and Niels Bohr have in common?

Competitors:NAGA vs. Google, Yahoo! Answers, BANKS (IIT Bombay), START (MIT)

User Study for Quality Assessment (2)

Benchmark # Q # A Metric Google Yahoo!

Answers

START BANKS

scoring

TREC QA 55 1098 NDCG P@1

75.88% 67.81%

26.15%

17.20%

75.38% 73.23%

87.93%

69.54%

92.75%

84.40%

SphereSearch 12 343 NDCG P@1

38.22% 19.38%

17.23%

2.87% 2.87%

88.82%

84.28%

91.01%

84.94%

Own 18 418 NDCG P@1

54.09% 27.95%

17.98%

13.35% 13.57%

85.59%

76.54%

91.33%

86.56%

Quality Measures:• Precision@1• NDCG: normalized discounted cumulative gain based on ratings highly relevant (2), somewhat relevant (1), irrelevant (0)with Wilson confidence intervals at = 0.95

Outline

Motivation

Ranking for Search over Entity-Relation Graphs (NAGA)

• Conclusion

Why RDF? Why a New Engine?

Marie Curie

Nobel Prize Physics

bornOn

Henri Becquerel

U Paris

Nobel Prize Chemistry

Warsaw

Pierre Curie

Maria Sklodowska

Poland

diedOn

bornAs

marriedTo

AlmaMater won

AwardwonAward

wonAward

bornOn

diedOnadvsior

bornIn

inCountry

• RDF triples (subject – property/predicate – value/object): (id1, Name, „Marie Curie“), (id1, bornAs, „Maria Sklobodowska“), (id1, bornOn, 1867), (id1, bornIn, id2), (id2, Name, „Warsaw“), (id2, locatedIn, id3), (id3, Name, „Poland“), (id1, marriedTo, id4), (id4, Name, „Pierre Curie“), (id1, wonAward, id5), (id4, wonAward, id5), …

• pay-as-you-go: schema-agnostic or schema later• RDF triples form fine-grained (ER) graph• queries bound to need many star-joins and long chain-joins• physical design critical, but hardly predictable workload

SPARQL Query Language

SPJ combinations of triple patternsEx:: Select ?c Where { ?p isa scientist . ?p bornIn ?t . ?p hasWon ?a . ?t inCountry ?c . ?a Name NobelPrize }

options for filter predicates, duplicate handling, wildcard join, etc.

Ex:: Select Distinct ?c Where { ?p ?r1 ?t . ?t ?r2 ?c . ?c isa <country> . ?p bornOn ?b . Filter (?b > 1945) }

support for RDFS: types

RDF & SPARQL Engines choice of physical design is crucial

giant triples table (vert. partitioned)property tables

clustered property tables(+ leftover table)

id1 Name Marie Curieid1 bornOn 1867 id1 bornIn id2id2 Name Warsawid2 Country id11id1 Advisor id5… … .,.

S P O S O id1 1867idid5 1852… …

bornOn

S O id1 id5… …

Advisor

id1 Marie C 1867 id2id2 Henri B 1852 id9 … … .,.

S Name bornOn bornIn …

Person

id2 Warsaw id11… … .,.

SESAME / OpenRDFYARS2 (DERI)

Jena (HP Labs)Oracle RDF_MATCH

+ physical design wizard !

C-Store (MIT)MonetDB (CWI)

column stores

+ materialized views

S Name Country

id2 Warsaw id11 … … .,.

RDF-3X: a RISC-style Engine[T. Neumann, G. Weikum: VLDB 2008]

Design rationale:• RDF-specific engine (not an RXORDBMS)• Simplify operations• Reduce implementation choices• Optimize for common case• Eliminate tuning knobs

Key principles:• Mapping dictionary for encoding all literals into ids• Exhaustive indexing of id triples• Index-only store, high compression• QP mostly merge joins with order-preservation• Very fast DP-based query optimizer• Frequent-paths synopses, property-value histograms

RDF-3X Indexing

index all collation orders of subject-property-object id triples: SPO, SOP, OSP, OPS, PSO, POS

• directly stored in clustered B+ trees• high compression: indexes < original data• can choose any order for scan & join

additionally index count-aggregated projections in all orders: SP, SO, OS, OP, PS, PO – with counter for each entry

• enables efficient bookkeeping for duplicates• also index projections S, P, O with count-aggregation

also need two mapping indexes: literal id, id literal,

RDF-3X Query OptimizationPrinciples:• optimizing join orders is key (star joins, long join chains)• should exploit exhaustive indexes and order-preservation• support merge-joins and hash-joins

Bottom-up dynamic programmingfor exhaustive plan enumeration (< 100ms for 20 joins)

Cost model based on selectivity estimation from• histograms for each of the 6 SPO orderings (approx. equi-depth)

• frequent join paths (property sequences) for stars and chains

a6ExampleQuery:

Experimental Evaluation: SetupSetup and competitors:2GHz dual core, 2 GB RAM, 30MB/s disk, Linux• column-store property tables by Abadi et al., using MonetDB• triples store with SPO, POS, PSO indexes, using PostgreSQL

Datasets:1) Barton library catalog: 51 Mio. triples (4.1 GB)2) YAGO knowledge base: 40 Mio. triples (3.1 GB)3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB)

Benchmark queries (7 or 8 per dataset) in the spirit of:1) counts of French library items (books, music, etc.), with creator, publisher, language, etc.2) scientist from Poland with French advisor who both won awards3) books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who ...

Select ?t Where {?b hasTitle ?t . ?u romance ?b .?u love ?b .?u mystery ?b .?u suspense ?b .?u crimeNovel ?c .?u hasFriend ?f .?f ... }

Experimental Evaluation: Results

DB sizes [GB]:

Barton Yago LibThingRDF-3X 2.8 2.7 1.6MonetDB 1.6-2.0 1.1-2.4 0.7-6.9PostgreSQL 8.7 7.5 5.7

DB load times [min]:

Barton Yago LibThingRDF-3X 13 25 20MonetDB 11 21 4PostgreSQL 30 25 20

Barton Yago LibThingRDF-3X 0.4 (5.9) 0.04 (0.7) 0.13 (0.89)MonetDB 3.8 ( 26.4) 54.6 (78.2) 4.39 (8.16)PostgreSQL 64.3 (167.8) 0.56 (10.6) 30.4 (93.9)

Geometric means for query run-times [sec]for warm (cold) cache

Outline

Motivation

Ranking for Search over Entity-Relation Graphs (NAGA)

• Conclusion

Efficient Query Processing (RDF-3X)

Summary & Outlook

lift world‘s best information sources (Wikipedia, Web, Web 2.0)

to the level of explicit knowledge (ER-oriented facts)

1) building knowledge graphs: combine semantic & statistical & social IE sources (for scholarly Web, digital libraries, enterprise know-how)

challenges in consistency vs. uncertainty, long-term evolution

3) efficiency and scalability challenges for search & ranking (top-k queries) and updates

2) heterogeneity & uncertain IE necessitate ranking new ranking models (e.g. statistical LM for graphs)

Thank You !

joint work with georgiana ifrim, gjergji kasneci, thomas neumann, maya ramanath, fabian suchanek

Documents

chief of velipoja commune message - fondi shqiptar i ......

gjergji vurmo - soros

liu, y., ramanath, h. s., & wang, d. a. (2008). tendon...

2bu924cc6vd2agpvm33dx7lq-wpengine.netdna-ssl.com · (ii)...

radio science september 2012 · pierre jarrige and gwenaël...

ganpati ramanath• best student talk award (saurabh garg),...

yet another great...

joint work with shady elbassuoni, georgiana ifrim, gjergji...

gjergji vurmo

ganpati ramanath professor of materials science and

gjergji gaqi mu 483 1 heilegenstadt and the innovations in

eresearch a max planck perspective - dini · eresearch a...

zyra e kryeministrit të kosovës · 2011. 2. 25. ·...

yago:a large ontology from wikipedia and wordnet fabian m....

deanna: natural language questions for the web of data...

coheel: coherent and e cient named entity linking through...

cio 100 2011-case study-ajay ramanath/kashyap...

databases & information retrieval maya ramanath ( further...

bellur ramanath, et al. v. coca-cola enterprises, inc., et...

vendim nr. 480, datë 31.7.2018 pËr miratimin e ......