joint work with georgiana ifrim, gjergji kasneci, thomas neumann, maya ramanath, fabian suchanek
Post on 01-Jan-2016
50 Views
Preview:
DESCRIPTION
TRANSCRIPT
weikum@mpi-inf.mpg.dehttp://www.mpi-inf.mpg.de/~weikum/
Gerhard Weikum
Harvesting, Searching, and RankingKnowledge from the Web
Joint work with Georgiana Ifrim, Gjergji Kasneci, Thomas Neumann,Maya Ramanath, Fabian Suchanek
2/38
VisionOpportunity: Turn the Web (and Web 2.0 and Web 3.0 ...) intothe world‘s most comprehensive knowledge base
Approach: 1) harvest and combine
a) hand-crafted knowledge sources (Semantic Web, ontologies)
b) automatic knowledge extraction (Statistical Web, text mining)
c) social communities and human computing (Social Web, Web 2.0)
2) express knowledge queries, search, and rank3) everything efficient and scalable
3/38
Why Google and Wikipedia Are Not Enough
German universities with world-class computer scientists
German Nobel prize winner who survived both world warsand all of his four children
proteins that inhibit proteases and other human enzymes
Answer „knowledge queries“ such as:
connection between Thomas Mann and Goethe
politicians who are also scientists
4/38
Which politiciansare also scientists ?
Why Google and Wikipedia Are Not Enough
What is lacking?Information is not Knowledge.Knowledge is not Wisdom.Wisdom is not TruthTruth is not Beauty.Beauty is not Music.Music is the best.
(Frank Zappa)
extract facts from Web pages capture user intention by concepts, entities, relations
5/38
NAGA Example
Query:$x isa politician$x isa scientist
Results:Benjamin FranklinPaul WolfowitzAngela Merkel…
6/38
Related Work
semistructured IR& graph search
Banks
TextRunner
DBexplorer
Cyc
Freebase
CimpleDBlife
UIMA
DBpedia
YagoNaga
XQ-FT
Libra
SPARQL
Avatar
EntityRank
Powerset
START
Webentity search& QA
informationextraction &ontologybuilding
TopX
Answers
SWSE
Hakia
Tijah
7/38
Outline
Motivation
• Information Extraction & Knowledge Harvesting (YAGO)
• Ranking for Search over Entity-Relation Graphs (NAGA)
• Conclusion
• Efficient Query Processing (RDF-3X)
8/38
Information Extraction (IE): Text to Records
Max Planck 4/23, 1858 KielAlbert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar
Person BirthDate BirthPlace ...
Person ScientificResult
Max Planck Quantum Theory
Person CollaboratorMax Planck Albert EinsteinMax Planck Niels Bohr
Planck‘s constant 6.2261023 Js
Constant Value Dimension
combine NLP, pattern matching, lexicons, statistical learning
extracted facts often have confidence < 1 DB with uncertainty (probabilistic DB)
expensive anderror-prone
9/38
High-Quality Knowledge SourcesGeneral-purpose ontologies and thesauri: WordNet family
scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist ... => principal investigator, PI …HAS INSTANCE => Bacon, Roger Bacon …
200 000 concepts and relations;can be cast into • description logics or • graph, with weights for relation strengths (derived from co-occurrence statistics)
10/38
{{Infobox_Scientist| name = Max Planck| birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]]| death_date = [[October 4]], [[1947]]| death_place = [[Göttingen]], [[Germany]]| residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]]| work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]]| alma_mater = [[Ludwig-Maximilians-Universität München]]| doctoral_advisor = [[Philipp von Jolly]]| doctoral_students = [[Gustav Ludwig Hertz]]</br>… | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]]| prizes = [[Nobel Prize in Physics]] (1918)…
Exploit Hand-Crafted KnowledgeWikipedia, WordNet, and other lexical sources
11/38
Exploit Hand-Crafted KnowledgeWikipedia, WordNet, and other lexical sources
12/38
YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, G. Weikum: WWW‘07]
• Turn Wikipedia into explicit knowledge base (semantic DB);
keep source pages as witnesses
• Exploit hand-crafted categories and infobox templates
• Represent facts as explicit knowledge triples:
relation (entity1, entity2)
(in FOL, compatible with RDF, OWL-lite, XML, etc.)
• Map (and disambiguate) relations into WordNet concept DAG
entity1 entity2relation
Max_Planck KielbornIn
Kiel CityisInstanceOf
Examples:
13/38
YAGO Knowledge Base [F. Suchanek et al.: WWW’07]
Entity
Max_Planck April 23, 1858
Person
City Country
subclass Location
subclass
instanceOf
subclass subclass
bornOn
“Max Planck”
means
“Dr. Planck”
means
subclass
October 4, 1947 diedOn
KielbornInNobel Prize Erwin_Planck
FatherOfhasWon
Scientist
means
“Max Karl Ernst Ludwig Planck”
Physicist
instanceOf
subclassBiologist
subclass
concepts
individuals
words
Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/
Accuracy 95% Entities Facts
KnowItAll 30 000SUMO 20 000 60 000WordNet 120 000 80 000Cyc 300 000 5 Mio.TextRunner n/a 8 Mio.YAGO 1.7 Mio. 15 Mio.DBpedia 1.9 Mio. 103 Mio.Freebase ??? ???
14/38
Wikipedia Harvesting: Difficulties & Solutions
• instanceOf relation: isleading and difficult category names
(„disputed articles“, „particle physics“, „American Music of the 20th Century“,
„Nobel laureates in physics“, „naturalized citizens of the United States“, … )
noun group parser: ignore when head word in singular• isA relation: mapping categories onto WordNet classes: „Nobel laureates in physics“ Nobel_laureates, „people from Kiel“ person
map to (singular of) head; exploit synsets and statistics• Entity name ambiguities: „St. Petersburg“, „Saint Petersburg“, „M31“, „NGC224“ means ...
exploit Wikipedia redirects & disambiguations, WN synsets• type checking for scrutinizing candidates:
accept fact candidate only if arguments have proper classes marriedTo (Max Planck, quantum physics) Person Person
15/38
Higher-Order Facts in YAGO
Berlin GermanyCapitalOf
BonnCapitalOf
1990-2008
validIn
1949-1989
validIn
facts about facts represented by reification as first-order factse314159
e314159 1990-2008validIn
Berlin GermanyCapitalOf
ArnoldSchwarzen-egger
Politician
ActorinstanceOf
instanceOf 1987-2008
validIn
2003-2008validIn
16/38
Ongoing Work: YAGO for Easier IEYAGO knows (almost) all (interesting) entitiesleverage for discovering & extracting new facts in NL texts
• can filter out many uninteresting sentences• can quickly identify relation arguments• can eliminate many fact candidates by type checking• can focus on specific properties like time
Seine ParisrunsThrough
river city
Cologne lies on the banks of the Rhine
Ss MVp DMc Mp Dg
JsJp
NP PPVP NP PP NP NPNP
People in Cairo like wine from the Rhine valley
Mp
Js Os
Sp Mvp DsJs
AN
NP NPPP VP PP NPNP NPNP
IE with dependency parser is expensive !
The city of Paris was founded on an island in the Seine in 300 BC
France
Europe
isa isa
locatedInlocatedInlocatedIn
17/38
Outline
Motivation
Information Extraction & Knowledge Harvesting (YAGO)
• Ranking for Search over Entity-Relation Graphs (NAGA)
• Conclusion
• Efficient Query Processing (RDF-3X)
18/38
NAGA: Graph Search [G. Kasneci et al.: ICDE‘08]
complex queries (with regular expressions)
computerscience $x scientist
isawonPrize
$u universityisa
worksAt |graduatedFrom
discovery queries connectedness queriesThomas MannGoethe * German
novelistisa
politician $x scientistisa isa
Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness
GermanylocatedIn*
$pinField
queries over reified facts Germany
1988
$ccapitalOf isacity
validIn
19/38
Search Results Without Rankingq: Fisher isa scientist Fisher isa $x
mathematician_109635652 —subClassOf—> scientist_109871938 Alumni_of_Gonville_and_Caius_College,_Cambridge —subClassOf—> alumnus_109165182 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938
$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = alumnus_109165182
$@Fisher = Irving_Fisher $@scientist = scientist_109871938 $X = social_scientist_109927304
$@Fisher = James_Fisher $@scientist = scientist_10981938 $X = ornithologist_109711173
$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = theorist_110008610
$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = colleague_109301221
$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = organism_100003226 …
20/38
Ranking with Statistical Language Modelq: Fisher isa scientist Fisher isa $x
Score: 7.184462521168058E-13 mathematician_109635652 —subClassOf—> scientist_109871938 "Fisher" —familyNameOf—> Ronald_Fisher Ronald_Fisher —type—> 20th_century_mathematicians "scientist" —means—> scientist_109871938 20th_century_mathematicians —subClassOf—> mathematician_109635652
$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = mathematician_109635652
$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = statistician_109958989
$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = president_109787431
$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = geneticist_109475749
$@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = scientist_109871938 …
Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/
statistical language model for result graphs
21/38
Ranking FactorsConfidence:Prefer results that are likely to be correct
Certainty of IE Authenticity and Authority of Sources
Informativeness:Prefer results that are likely importantMay prefer results that are likely new to user
Frequency in answer Frequency in corpus (e.g. Web) Frequency in query log
Compactness:Prefer results that are tightly connected
Size of answer graph
bornIn (Max Planck, Kiel) from„Max Planck was born in Kiel“(Wikipedia)
livesIn (Elvis Presley, Mars) from„They believe Elvis hides on Mars“(Martian Bloggeria)
q: isa (Einstein, $y)
isa (Einstein, scientist)isa (Einstein, vegetarian)
q: isa ($x, vegetarian)
isa (Einstein, vegetarian)isa (Al Nobody, vegetarian)
Einstein
vegetarian
BohrNobel Prize
Tom Cruise
1962
isa isa bornIn
diedInwon
won
22/38
NAGA Ranking Model
i
n
iii qPqPqP
1
|)1(| gg
Following the paradigm of statistical language models(used in speech recognition and modern IR), applied to graphs
For query q with fact templates q1 … qn bornIn ($x, Frankfurt)
rank result graphs g with facts g1 … gn bornIn (Goethe, Frankfurt)
by decreasing likelihoods:usinggenerativemixture model
reflectinformativeness
backgroundmodel
weights subqueriesEx.: bornIn ($x, Germany) & wonAward ($x, Nobel)
23/38
NAGA Ranking Model: Informativeness
Estimate P[qi | gi]
'x
)z,r,'x(P
)z,r,x(P
)z,r(P
)z,r,x(P)z,r|x(P
for qi = (x*, r, z) with var x* (analogously for other cases)
Ex.: bornIn ($x, Frankfurt)
Albert Einsteinisa
vegetarian
physicistisa
Ex.: isa (Einstein, $z)
bornIn (GW, Frankfurt)
bornIn (Goethe, Frankfurt)
isa (Einstein, physicist)
bornIn (Einstein, vegetarian)
Estimate on knowledge graph: Estimate on Web(exploit redundancy):
freq (Einstein, isa, physicist)vs.freq (Einstein, isa, vegetarian)
24/38
NAGA Example
Query:$x isa politician$x isa scientist
Results:Benjamin FranklinPaul WolfowitzAngela Merkel…
25/38
User Study for Quality Assessment (1)Benchmark:• 55 queries from TREC QA 2005/2006 Examples: 1) In what country is Luxor? 2) Discoveries of the 20th Century?• 12 queries from work on SphereSearch Examples: 1) In which movies did a governor act? 2) Firstname of politician Rice?• 18 regular expression queries by us
Example: What do Albert Einstein and Niels Bohr have in common?
Competitors:NAGA vs. Google, Yahoo! Answers, BANKS (IIT Bombay), START (MIT)
26/38
User Study for Quality Assessment (2)
Benchmark # Q # A Metric Google Yahoo!
Answers
START BANKS
scoring
NAGA
TREC QA 55 1098 NDCG P@1
75.88% 67.81%
26.15%
17.20%
75.38% 73.23%
87.93%
69.54%
92.75%
84.40%
SphereSearch 12 343 NDCG P@1
38.22% 19.38%
17.23%
6.15%
2.87% 2.87%
88.82%
84.28%
91.01%
84.94%
Own 18 418 NDCG P@1
54.09% 27.95%
17.98%
6.57%
13.35% 13.57%
85.59%
76.54%
91.33%
86.56%
Quality Measures:• Precision@1• NDCG: normalized discounted cumulative gain based on ratings highly relevant (2), somewhat relevant (1), irrelevant (0)with Wilson confidence intervals at = 0.95
27/38
Outline
Motivation
Information Extraction & Knowledge Harvesting (YAGO)
Ranking for Search over Entity-Relation Graphs (NAGA)
• Conclusion
• Efficient Query Processing (RDF-3X)
28/38
Why RDF? Why a New Engine?
Marie Curie
Nobel Prize Physics
bornOn
Henri Becquerel
U Paris
Nobel Prize Chemistry
1867
1934
Warsaw
Pierre Curie
Maria Sklodowska
1852
1908
Poland
diedOn
bornAs
marriedTo
AlmaMater won
AwardwonAward
wonAward
bornOn
diedOnadvsior
won
Awar
d
bornIn
inCountry
• RDF triples (subject – property/predicate – value/object): (id1, Name, „Marie Curie“), (id1, bornAs, „Maria Sklobodowska“), (id1, bornOn, 1867), (id1, bornIn, id2), (id2, Name, „Warsaw“), (id2, locatedIn, id3), (id3, Name, „Poland“), (id1, marriedTo, id4), (id4, Name, „Pierre Curie“), (id1, wonAward, id5), (id4, wonAward, id5), …
• pay-as-you-go: schema-agnostic or schema later• RDF triples form fine-grained (ER) graph• queries bound to need many star-joins and long chain-joins• physical design critical, but hardly predictable workload
29/38
SPARQL Query Language
SPJ combinations of triple patternsEx:: Select ?c Where { ?p isa scientist . ?p bornIn ?t . ?p hasWon ?a . ?t inCountry ?c . ?a Name NobelPrize }
options for filter predicates, duplicate handling, wildcard join, etc.
Ex:: Select Distinct ?c Where { ?p ?r1 ?t . ?t ?r2 ?c . ?c isa <country> . ?p bornOn ?b . Filter (?b > 1945) }
support for RDFS: types
30/38
RDF & SPARQL Engines choice of physical design is crucial
giant triples table (vert. partitioned)property tables
clustered property tables(+ leftover table)
id1 Name Marie Curieid1 bornOn 1867 id1 bornIn id2id2 Name Warsawid2 Country id11id1 Advisor id5… … .,.
S P O S O id1 1867idid5 1852… …
bornOn
S O id1 id5… …
Advisor
id1 Marie C 1867 id2id2 Henri B 1852 id9 … … .,.
S Name bornOn bornIn …
Person
id2 Warsaw id11… … .,.
SESAME / OpenRDFYARS2 (DERI)
Jena (HP Labs)Oracle RDF_MATCH
+ physical design wizard !
C-Store (MIT)MonetDB (CWI)
column stores
+ materialized views
S Name Country
Town
id2 Warsaw id11 … … .,.
31/38
RDF-3X: a RISC-style Engine[T. Neumann, G. Weikum: VLDB 2008]
Design rationale:• RDF-specific engine (not an RXORDBMS)• Simplify operations• Reduce implementation choices• Optimize for common case• Eliminate tuning knobs
Key principles:• Mapping dictionary for encoding all literals into ids• Exhaustive indexing of id triples• Index-only store, high compression• QP mostly merge joins with order-preservation• Very fast DP-based query optimizer• Frequent-paths synopses, property-value histograms
32/38
RDF-3X Indexing
index all collation orders of subject-property-object id triples: SPO, SOP, OSP, OPS, PSO, POS
• directly stored in clustered B+ trees• high compression: indexes < original data• can choose any order for scan & join
additionally index count-aggregated projections in all orders: SP, SO, OS, OP, PS, PO – with counter for each entry
• enables efficient bookkeeping for duplicates• also index projections S, P, O with count-aggregation
also need two mapping indexes: literal id, id literal,
33/38
RDF-3X Query OptimizationPrinciples:• optimizing join orders is key (star joins, long join chains)• should exploit exhaustive indexes and order-preservation• support merge-joins and hash-joins
Bottom-up dynamic programmingfor exhaustive plan enumeration (< 100ms for 20 joins)
Cost model based on selectivity estimation from• histograms for each of the 6 SPO orderings (approx. equi-depth)
• frequent join paths (property sequences) for stars and chains
?x1p1
?x2p2
?x3p3
?x4p4
?x5p5
?x6
v1
a1
v4
a4
v6
a6ExampleQuery:
34/38
Experimental Evaluation: SetupSetup and competitors:2GHz dual core, 2 GB RAM, 30MB/s disk, Linux• column-store property tables by Abadi et al., using MonetDB• triples store with SPO, POS, PSO indexes, using PostgreSQL
Datasets:1) Barton library catalog: 51 Mio. triples (4.1 GB)2) YAGO knowledge base: 40 Mio. triples (3.1 GB)3) Librarything social-tagging excerpt: 30 Mio. triples (1.8 GB)
Benchmark queries (7 or 8 per dataset) in the spirit of:1) counts of French library items (books, music, etc.), with creator, publisher, language, etc.2) scientist from Poland with French advisor who both won awards3) books tagged with romance, love, mystery, suspense by users who like crime novels and have friends who ...
Select ?t Where {?b hasTitle ?t . ?u romance ?b .?u love ?b .?u mystery ?b .?u suspense ?b .?u crimeNovel ?c .?u hasFriend ?f .?f ... }
35/38
Experimental Evaluation: Results
DB sizes [GB]:
Barton Yago LibThingRDF-3X 2.8 2.7 1.6MonetDB 1.6-2.0 1.1-2.4 0.7-6.9PostgreSQL 8.7 7.5 5.7
DB load times [min]:
Barton Yago LibThingRDF-3X 13 25 20MonetDB 11 21 4PostgreSQL 30 25 20
Barton Yago LibThingRDF-3X 0.4 (5.9) 0.04 (0.7) 0.13 (0.89)MonetDB 3.8 ( 26.4) 54.6 (78.2) 4.39 (8.16)PostgreSQL 64.3 (167.8) 0.56 (10.6) 30.4 (93.9)
Geometric means for query run-times [sec]for warm (cold) cache
36/38
Outline
Motivation
Information Extraction & Knowledge Harvesting (YAGO)
Ranking for Search over Entity-Relation Graphs (NAGA)
• Conclusion
Efficient Query Processing (RDF-3X)
37/38
Summary & Outlook
lift world‘s best information sources (Wikipedia, Web, Web 2.0)
to the level of explicit knowledge (ER-oriented facts)
1) building knowledge graphs: combine semantic & statistical & social IE sources (for scholarly Web, digital libraries, enterprise know-how)
challenges in consistency vs. uncertainty, long-term evolution
3) efficiency and scalability challenges for search & ranking (top-k queries) and updates
2) heterogeneity & uncertain IE necessitate ranking new ranking models (e.g. statistical LM for graphs)
38/38
Thank You !
top related