AKSW, Universität Leipzig
BOAExtracting Multilingual Natural-Language Patterns for RDF Predicates
Daniel Gerber Axel-Cyrille Ngonga Ngomo
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Motivation
๏ Most knowledge bases are extracted from (semi)-structured data
๏ Only 15-20 % of information in structured data
๏ Semantic Web ⬌ Document Web
๏ How can we extract data from the document-oriented web?
2
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Idea I
3
dbpedia:Barack_Obama
dbpedia:Honolulu,_Hawaii
dbpedia:Democratic_Party
dbpedia:Michelle_Obama
dbpedia-owl:birthPlace
dbpedia-owl:party
dbpedia-owl:spouse
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Idea II
Barack Obama was born in Honolulu, Hawaii.
Barack Hussein Obama is a politician of the Democratic Party.
Obama married Michelle Robinson in 1992.
4
is a politician of the
met
was born in
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Idea III
5
is a politician of the married
was born in
Joseph Martin "Joschka" Fischer (born 1948-04-12) is a politician of the German Green Party.
Dietrich's only child, Maria Elisabeth Sieber, was born in Berlin on 13 December 1924.
Jackie Bouvier Kennedy Onassis who married John F. Kennedy was tied to the Auchinclosses via her sister's marriage into the Auchincloss family.
EKAW - http://boa.aksw.org10.10.2012- Page
Bootstrapping the Data Web
The BOA approach
6
Data Web
Web
Corpora
Surfaceforms
Patterns
SPARQL
Search & Filter
Filter
FeatureExtraction
Generation
Corpus Extraction Module
Crawler
Cleaner
Indexer
NeuralNetwork
1
2
3 4
5 6
7
8
3 4
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Pattern Search
(1) Set of entities s and o connected through p(2) Find all sentences which contain s and o(3) Replace labels with variables (?D?, ?R?)
7
BOA pattern: BOA pattern mapping:
“?D? with his wife ?R?”
“?D? with his wife ?R?”
“?D? and his wife ?R?”
“?D? and her husband ?R?”
dbpedia-owl:spouse
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Feature Extraction - Language Independent
8
Supportpattern should be used across several triples
๏ Google - DoubleClick: 2
๏ General Motors - Opel:1
๏ Cablevision - Rainbow Media: 4
subsidiary ↣ “?Company was acquired by ?Company”Specificitypattern should not be used by many pattern mappings
๏ subsidiary:
“?R? is a part of ?D?”
๏ foundationOrg:
“?R? is a part of ?D?”
Typicitypattern should be used to connect entities of correct type
๏ Hypercom_ORG was_O
acquired_O by_O
Verifone_ORG ._O
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Feature Extraction - Language Dependent
9
rdfs:label
dbpedia:subsidiary
Intrinsic Information Content Metric
“subsidiary”@en
?D? was acquired by ?R?
Wordnet
ReVerb
๏ Open Information Extraction
๏ Patterns need to abide a POS
tag sequence
๏ Logistic Regression Classifier
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
BOA Neuronal Network
10
Input Layer[0,1]
Hidden Layer Output Layer[0,1]
Reverb
Specificity
IICM
Typicity
๏ 200 patterns are manually classified as good (1) or bad (0)
๏ up to 18 features, depending on language
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
RDF Generation
11
dbpedia-owl:spouse
‘‘Leyla Rodriguez Stahl’’@en
rdfs:label
‘‘Abel Pacheco’’@en
rdfs:label
dbpedia-owl:Person
rdf:type
dbpedia-owl:Person
rdf:type
Pacheco_PER arrived_O with_O his_O wife_O Leyla_PER Rodriguez_PER Stahl_PER and_O
?D? with his wife ?R?
Pacheco arrived with his wife Leyla Rodriguez Stahl and several...
boa:Leyla_Rodriguez_Stahldbpedia:Abel_PachecoNEW NEW
NEW
NEW
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Evaluation I
12
en-wiki en-news de-wiki de-news
Language english english german german
Topic general knowledge news general knowledge news
# of sentences 58M 214,2M 24,6M 112,8M
# of tokens per sentence 21,4 22,1 17,4 18,3
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Evaluation II
13
en-wiki en-news de-wiki de-news
# of pattern mappings 125 44 66 19
# of patterns 9551 586 7366 109
# of new triples 78944 22883 10138 883
# of known triples 1829 798 655 42
# of found triples 80773 3081 10793 925
Precision Top-100 92 % 70 % 91 % 74 %
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
Conclusion
๏ No manual created seed patterns needed๏ > 90% precision for german an english dataset๏ high recall through surface forms๏ Output easily integrable in LOD Cloud๏ Library of natural-language representations of
formal relations, Demo
14
Bootstrapping the Data Web
EKAW - http://boa.aksw.org10.10.2012 - Page
BOA Graphical User Interface
15
http://boa.aksw.org
LOD2 Presentation . 02.09.2010 . Page http://lod2.eu
Thank you!Questions?
Daniel GerberAugustusplatz 10, Room P61604109 Leipzig, GermanySIMBA@AKSWhttp://bis.informatik.uni-leipzig.de/DanielGerberhttp://boa.aksw.orghttp://code.google.com/p/boa