using the web of data for information extraction
DESCRIPTION
Talk at Insiders Technologies , 21.01.2010. It's about publishing RDF data with D2R-server, link the data to get Linked Data, query the data with SPARQL via SQUIN and finally annotate text with this data by using RDFa in Epiphany.TRANSCRIPT
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Using the Web of Data for
Information Extraction
sparqlrdf
rdfaD2R server
scoobie
epiphanysquin
Linked DataOBIE
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Are you still surfing ...
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010… or overloaded?
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
What are the cities of the universities in Rhineland Palatinate and what is the unemployment rate of these cities?
A simple question ...
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>PREFIX eurostat: <http://www4.wiwiss.fu-berlin.de/eurostat/resource/eurostat/>PREFIX dbpedia: <http://dbpedia.org/ontology/>PREFIX dbpedia_cat: <http://dbpedia.org/resource/Category>
SELECT ?dbpcity ?cityName ?ur WHERE {?uni skos:subject dbpedia_cat:Universities_and_colleges_in_Rhineland-Palatinate; dbpedia:city ?dbpcity .?dbpcity owl:sameAs ?statcity. ?statcity rdfs:label ?cityName ;
eurostat:unemployment_rate_total ?ur }
What are the cities of the universities in Rhineland Palatinate and what is the unemployment rate of these cities?
A simple question ...
http://www.w3.org/TR/rdf-sparql-query/
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010… and its answer.
dbpcity cityName ur
http://dbpedia.org/resource/Koblenz Koblenz 8.8http://dbpedia.org/resource/Trier Trier 7.3
Data Sources:
Query Engine: SQUIN - Query the Web of Linked Data http://squin.sourceforge.net/
http://wiki.dbpedia.orghttp://epp.eurostat.ec.europa.euhttp://www4.wiwiss.fu-berlin.de/eurostat/
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
So much data out there, too much?
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010What data do you have?
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Are you still surfing ...
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Agenda
In order to use Web of Data for information extraction, you have to understand its basics.● RDF on one slide● Publish data in RDF with D2R Server● Publish RDF as Linked Data● Query Linked Data with SPARQL and Squin● Use RDF for information extraction● Bring Linked Data to text via RDFa
11Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Wouldn't this be nice.
Data
12Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Wouldn't this be nice.
Data Text
Extraction Pipeline
ExtractionResults
enrich
User-defined Filter
13Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Wouldn't this be nice.
Data Text
Extraction Pipeline
ExtractionResults
enrich
User-defined Filter
annotate
annotatedtext
14Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Wouldn't this be nice.
Data Text
Extraction Pipeline
ExtractionResults
populate
enrich
User-defined Filter
annotate
annotatedtext
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF on one slide
* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf
Found at:
@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .
dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf <http://dblp.l3s.de/d2r/resource/publications//icdar/SchulzEGAAD09> .
<http://dblp.l3s.de/d2r/resource/publications/conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF on one slide
* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf
Found at:
@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .
dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .
Vocabularies
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF on one slide
* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf
Found at:
@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .
dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .
URLs / URIs
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF on one slide
* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf
Found at:
@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .
dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .
Subjects
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF on one slide
* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf
Found at:
@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .
dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .
Predicates
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF on one slide
* From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf
Found at:
@prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dc: <http://purl.org/dc/terms/> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix acm: <http://acm.rkbexplorer.com/description/> .
dblp_author:Michael_Gillmannfoaf:name „Michael Gillmann“ ;rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ;rdf:type foaf:Agent ;owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ;foaf:isMakerOf
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> .
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>dc:creator dblp_author:Michael_Gillmann ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .
Objects
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF data is graph data.
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Publishing relational data in RDF
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Publishing relational data in RDF
./generate-mapping-o mydatabase.n3 -b http://projects.dfki.uni-kl.de/mydatabase/jdbc:mysql://localhost:3306/mydatabase
./d2r-server -p 80 -b http://projects.dfki.uni-kl.de/mydatabase/mydatabase.n3
D2R Server - Publishing Relational Databases on the Semantic Web
http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/
Two small command line calls:
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Linked Data: Linking RDF data from different sources
Customer DB Employees DB
Project DB DBpedia
How to interlinkthese datasets?
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Linked Data: Linking RDF data from different sources
Linked Data Principles (TimBL, 2006)
1. Use URIs as names for things (e.g., http://dbpedia.org/resource/Berlin)
2. Use HTTP-URIs so that people can look up those names3. Provide useful information in RDF when someone looks up an URI4. Include links to other URIs to enable discovery of more information
Example:
<http://dbpedia.org/resource/Berlin> owl:sameAs opencyc:en/CityOfBerlinGermany ;
owl:sameAs opencyc:en/Berlin_StateGermany owl:sameAs <http://sws.geonames.org/2950159/> owl:sameAs <http://www4.wiwiss.fu-berlin.de/eurostat/resource/regions/Berlin> owl:sameAs freebase:http://dbpedia.org/resource/Berlin
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
SPARQL: Querying RDF data
SPARQL - the RDF query language.
In contrast to SQL, it's data model is not set oriented but graph oriented.
Some Examples:
Resulting in tuples:SELECT ?interest ?friend WHERE {
<http://www.w3.org/People/BernersLee/card#i> foaf:knows ?friend . ?friend foaf:interest ?interest . }
Resulting as graph :CONSTRUCT {?friend foaf:interest ?interest } WHERE {
<http://www.w3.org/People/BernersLee/card#i> foaf:knows ?friend . ?friend foaf:interest ?interest . }
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
SPARQL: Query Linked Data from different sources
Customer DB Employees DB
Project DB DBpedia
How to accessthese datasets with a single
SPARQL query?
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
SPARQL: Query Linked Data from different sources
D2R Server D2R Server
D2R Server D2R Server
Customer DB Employees DB
Project DB DBpedia
Squin: Query the Web of Linked Data
http://squin.sourceforge.net/
Squin follows a Link Traversal approach over HTTP URIs.
Remember:
SELECT DISTINCT ?c ?cityName ?ur WHERE {?u skos:subjectdbpedia_cat:Universities_and_colleges_in_Rhineland-Palatinate; dbpedia:city ?c . ?c owl:sameAs [ rdfs:label ?cityName ; eurostat:unemployment_rate_total ?ur ]}
SQUINSQUIN
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Using RDF and Linked Data for Information Extraction
User Linked Data
Text Extraction Pipeline
Query
Result Graph
asks question
about
answersto
Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Using RDF and Linked Data for Information Extraction
What data do we have?
Classes Instances Datatype Properties Object Properties
<http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09>rdf:type foaf:Document ;dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ .
foaf:Documentfoaf:Person
.../SchulzEGAAD09
.../Markus_Ebbeckedc:titlefoaf:namefoaf:firstNamefoaf:surName
dc:creatorfoaf:knows
Literals
„Markus“„Ebbecke“„Seizing the Treasure: Transferring Knowledge in Invoice Analysis“
Example RDF data
31Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
SCOOBIEDomain Adaption
Vocabulary Data
Instance Data
Information Extraction (online)
Data Preprocessing & Learning (offline)
Structured Data
Text Corpus Data
Patterns andGazetteers
Data
32Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
SCOOBIEEco System
TrainingCorpus
Patterns + Gazetteers
Text Corpus
Ontology
Instances
Domain Knowledge Models
Ses
sion
Dat
aT
asks
Index
Pre-process Train Extract
OIAP
I
I
Models
33Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
SCOOBIEOBIE Pipeline
Normalization Text ExtractionLanguage Detection
Segmentation TokenizationSentence ExtractionPOS-Tagging
Symbolization Named Entity RecognitionStructured Entity RecognitionNoun Phrase ChunkingSymbol Recognition
Instantiation Instance RecognitionInstance DisambiguationChunk Classification
Contextualization Fact ExtractionFact Selection
Population Query Answering
34Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Used MachineLearning Models
Regex matching statistics (Structured Entity Recognition)
Supervised Learning
Unsupervised or Instance-based Learning
Gazetteer matching statistics (Named Entity Recognition)
CRF-based Noun Phrase Chunker
K-Nearest-Neighbor chunk classifier (Chunk Classification)Spreading Activation-based fact ranking (Fact Selection)
I
I
I
TF/IDF-based instance re-ranking (Instance Disambiguation)
Semi-Supervised Learning
35Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Used Machine Learning: Conditional Random Field
CRFs are sequence taggers:
Train it with: Bill CAPITALIZED nounslept LOWERCASE non-nounhere LOWERCASE non-noun
Test it with: He CAPITALIZEDvisited LOWERCASELondon CAPITALIZED
CRF results: nounnon-nounnon-noun
MALLET - MAchine Learning for LanguagE Toolkit
http://mallet.cs.umass.edu/
36Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
Bringing Linked Data to Text
Annotate plain text or HTML with RDF data.
I'm working at DFKI.
RDFa offers an HTML extension:
I'm working at<span about="dbpedia:DFKI" property="rdfs:label">DFKI</span>
Now lets generate RDFa automatically ...
37Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Do you remember?
Data Text
Extraction Pipeline
ExtractionResults
populate
enrich
User-defined Filter
annotate
annotatedtext
38Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF Epiphany
Epiphany takes the original webpage …
39Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF Epiphany
Epiphany takes the original webpage …and SCOOBIE initialized with an RDF data set …
40Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF Epiphany
Epiphany takes the original webpage …and SCOOBIE initialized with an RDF data set …It extracts RDF information from text and annotates it asRDFa…
41Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010RDF Epiphany
Epiphany takes the original webpage …and SCOOBIE initialized with an RDF Linked Data set …It extracts RDF information from text and annotates it asRDFa…clicking on RDFa annotationsopens further information fromthe Linked Data set
42Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010
SCOOBIE
RDF Epiphany
At a glance
● Epiphany is a free web service.
● Epiphany uses SCOOBIE.
● Epiphany can be initialized with any RDFLinked Data set.
● Epiphany generates an RDF document about a web page.
● Epiphany annotates RDF as RDFa in the web page.
http://projects.dfki.uni-kl.de/epiphany/
43Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Summary
Text
Extraction Pipeline
ExtractionResults
populate
enrich
User-defined Filter
annotate
annotatedtext
D2R Server
D2R Server
D2R Server
D2R Server
Customer DB Employees DB
Project DB DBpedia
SQUINSQUIN
44Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Outlook
Extraction Pipeline
ExtractionResults
populate
enrich
User-defined Filter
annotate
annotatedE-Mail
D2R Server
D2R Server
D2R Server
D2R Server
Customer DB Employees DB
Project DB DBpedia
SQUINSQUIN
45Benjamin Adrianhttp://www.dfki.uni-kl.de/~adrian
InsidersJanuary
2010Thank you!
sparqlrdf
rdfaD2R server
scoobie
epiphanysquin
Linked DataOBIE