knowledge graph 101 –from the perspective of engineers
TRANSCRIPT
Knowledge Graph 101 – from
the perspective of engineers
A Brief Introduction to
Knowledge Graph
Google: Network of ‘things’Improved search and subject indexing
The key for ‘Smart Data’
Things, not strings!
Web of Documents
About:
•United States
•Barack Obama
•Presidential Election (Past)
•Some relevance to currently held
•Democrats & Republicans
•Winner & Looser
•Chicago
•Etc.. About:
•Location, Event, Places, Persons,
Groups, Abstract concepts (winning,
losing)
Web of Documents
People can parse web of documents and
extract information from them
humansthe web to
The web of documents
Analogy– Global file system
Designed for– Human consumption
Primary objects– documents
Links between– documents (or sub-parts of)
Semantics– implicit
The web of documents: Issues
Web of Documents but primarily About Data– But the connection is implicit
Integration & Querying– Show me all the news stories by US Presidents coming from
Chicago?
Semantic Web
•We need to help machines to understand the web, so machines can
help us to understand things.
•If machines have access to the data about things (i.e. knowledge)
then they can do better job while processing documents
Web of Data (Linked Data)
A
Thing
Thing
B
Thing
Thing
C
Thing
Thing
...
...
...
typed links typed links
Linked Data…
…. is about creating global database of linked
things
…refers to a set of best practices for
publishing and interlinking data on the Web…
….is a method of publishing data [on the
Web], so that it can be interlinked and become
more useful.
The Web of Linked Data
Analogy– a global database
Designed for– machines first, Humans later
Primary objects– things (or descriptions of things)
Links between– things
Semantics– explicit
Semantic Web Standard Stack
Semantic Web Standard Stack
Semantic Technologies : URIs
Like URLs but not just for Web pages– For things (cars, people, places, organisations, coursework, etc.)
“A Uniform Resource Identifier (URI) provides a simple
and extensible means for identifying a resource.” -- RFC
3986
Many different schemes – http://, ftp://, mailto:
Examples: http://ecust.edu.cn/ontologies/foaf/whf/me.rdf
http://dbpedia.org/resource/China
HTTP
Data access mechanism between web
browsers (client) and servers
HTTP messages consists of requests from
client to servers and responses from servers
to clients
HTTP request/response methods: GET,
POST, etc.
Semantic Technologies: RDF
Data format to describe things and their
interrelations
is based on triples
Subject, predicate, object
<The sky> <has the colour> <blue>
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
Web of Data: RDF, Tables, Microdata
YAGO
Cyc
TextRunner/
ReVerbWikiTaxonomy/
WikiNet
SUMO
ConceptNet 5
BabelNet
ReadTheWeb
30 Bio. SPO triples (RDF) and growing
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
Web of Data: RDF, Tables, Microdata
YAGO
30 Bio. SPO triples (RDF) and growing
• 10M entities in
350K classes
• 120M facts for
100 relations
• 100 languages
• 95% accuracy
• 4M entities in
250 classes
• 500M facts for
6000 properties
• live updates
• 25M entities in
2000 topics
• 100M facts for
4000 properties
• powers Google
knowledge graph
Ennio_Morricone type composerEnnio_Morricone type GrammyAwardWinnercomposer subclassOf musicianEnnio_Morricone bornIn RomeRome locatedIn ItalyEnnio_Morricone created Ecstasy_of_GoldEnnio_Morricone wroteMusicFor The_Good,_the_Bad_,and_the_UglySergio_Leone directed The_Good,_the_Bad_,and_the_Ugly
rdf.freebase.com/ns/en.romedata.nytimes.com/51688803696189142301
geonames.org/3169070/roma
N 41° 54' 10'' E 12° 29' 2''
dbpedia.org/resource/Rome
yago/wordnet:Actor109765278
yago/wikicategory:ItalianComposer
yago/wordnet: Artist109812338
imdb.com/name/nm0910607/
Linked RDF Triples on the Web
imdb.com/title/tt0361748/
dbpedia.org/resource/Ennio_Morricone
500 Mio. links
triples distribution
links distribution
http://lod-cloud.net/state/
Linked Open Data cloud stats
Embedding (RDF) Microdata in HTML Pages
May 2, 2011
Maestro Morricone will perform
on the stage of the Smetana Hall
to conduct the Czech National
Symphony Orchestra and Choir.
The concert will feature both
Classical compositions and
soundtracks such as
the Ecstasy of Gold.
In programme two concerts for
July 14th and 15th.
<html … May 2, 2011
<div typeof=event:music>
<span id="Maestro_Morricone">
Maestro Morricone
<a rel="sameAs"
resource="dbpedia/Ennio_Morricone "/>
</span>…
<span property = "event:location" >
Smetana Hall </span>
…
<span property="rdf:type"
resource="yago:performance">
The concert </span> will feature
…
<span property="event:date"
content="14-07-2011"></span>
July 1
</div>
Supported by RDFa
and microformats
like schema.org
Web Data Commons
Use Case: Question Answering
This town is known as "Sin City" & its
downtown is "Glitter Gulch"
This American city has two airports
named after a war hero and a WW II battle
knowledge
back-ends
question
classification &
decomposition
D. Ferrucci et al.: Building Watson. AI Magazine, Fall 2010.
IBM Journal of R&D 56(3/4), 2012: This is Watson.
Q: Sin City ?
movie, graphical novel, nickname for city, …
A: Vegas ? Strip ?
Vega (star), Suzanne Vega, Vincent Vega, Las Vegas, …
comic strip, striptease, Las Vegas Strip, …
45
Moon Shots in Anderson Cancer Center
Dynamic Semantic Publishing in BBC
Looking Inside the Data
Model and Query Language
of Knowledge Graph
RDF is the first layer of the
semantic web standards
Introduction to RDF
RDF stands for
Resource Description Framework
Introduction to RDF
RDF stands for
Resource: pages, images, videos, ...
everything that can have a URI
Description: attributes, features, and
relations of the resources
Framework: model, languages and
syntaxes for these descriptions
Introduction to RDF
RDF model
In RDF knowledge always comes in three.
RDF is a triple model i.e. every piece of
knowledge is broken down into
( subject , predicate , object )
Example of RDF
doc.html has author Haofen and has theme
Music
doc.html has author Haofen
doc.html has theme Music
Example of RDF
doc.html has author Haofen and has theme
Music
( doc.html, author, Haofen)
( doc.html, theme, Music )
Predicate
Subject
Object
a triplethe RDF atom
RDF is also a graph model
to link the descriptions of resources
RDFtriples can be seen as arcs
of a graph (vertex, edge, vertex)
(doc.html, author, Haofen)
(doc.html, theme, Music)
Haofen
author
doc.html
theme
Music
RDFin resources and properties are
identified by URIs
http://mydomain.org/mypath/myresource
http://ex.org/~haofen#me
http://ex.org/schema#author
http://ex.org/rr/doc.html
http://ex.org/schema#theme
Music
RDFin values of properties can also
be literals i.e. strings of characters
(doc.html, author, Haofen)
(doc.html, theme, "Music")
http://ex.org/~haofen#me
http://ex.org/schema#author
http://ex.org/rr/doc.html
http://ex.org/schema#theme
“Music”
RDFin literal values of properties
can also be typed with XML datatypes
doc.html has one author Haofen
and has 192 pages
http://ex.org/~haofen#me
http://ex.org/schema#author
http://ex.org/rr/doc.html
http://ex.org/schema#nbPages
"192"^^xsd:integer
RDF Blank Nodes
RDF allows blank nodes.
A resource may be anonymous
i.e. not identified by a URI, and noted _: xyz
E.g. there exists a report about Music
71
http://ex.org/schema#Report
rdf:type
_:x
http://ex.org/schema#theme
"Music"
RDF is Data Model, Not
Serialisation Format
RDF Serialisation Formats : RDF/XML, Turtle, N-Triples
– RDF/XML
<rdf:RDF
xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#
xmlns:foaf=http://xmlns.com/foaf/0.1 />
<foaf:Person rdf:ID="me">
<foaf:name>Haofen Wang</foaf:name>
<foaf:title>Dr</foaf:title>
<foaf:based_near rdf:resource="http://dbpedia.org/resource/Leeds"/>
RDF is Data Model, Not
Serialisation Format
RDF Serialisation Formats : RDF/XML, Turtle, N-Triples
– Turtle
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dt: < http://ecust.edu.cn/ontologies/foaf/whf/me.rdf#>
dt:me
rdf:type foaf:Person ;
foaf:name “Haofen Wang" ;
foaf:title “Dr" .
RDF is Data Model, Not
Serialisation Format
RDF Serialisation Formats : RDF/XML, Turtle, N-Triples
– N-Triples
< http://ecust.edu.cn/ontologies/foaf/whf/me.rdf#me>
<xmlns:foaf=http://xmlns.com/foaf/0.1#name> “Haofen Wang”.
< http://ecust.edu.cn/ontologies/foaf/whf/me.rdf#me>
< http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<xmlns:foaf=http://xmlns.com/foaf/0.1#Person>.
open-world
assumption
76
as opposed to the closed
world assumption of classical
systems
in short: the absence of a
triple is not significant
77
(doc.html, author, Haofen)
doesn't mean doc.html has one author
78
(doc.html, author, Haofen)
means doc.html has at least one author
79
RDF – Distribute by Cells!
Needs to reference both schema
and entities
Most flexible – can distribute data
in any way at all!
Family
SP1 Orchidaceae
Duration
sp1 Perennial
Status
sp1 Endangered
Family
SP1 Orchidaceae
Universal Resource Identifier (URI) as common reference
<http://www.usda.gov/classification/plants/species.owl#Orchidaceae>
<http://www.usda.gov/classification/plants/taxaonomy.owl#Family>
Distribute by cells!?Family
SP1 Orchidaceae
Subject
Predicate
Object
URI’s
<SP1> <Family> <Orchidaceae>
Resource Description Framework (RDF)
3 Triples with Same Subject
<SP1>
<SP1>
<SP1>
Integrate Automatically
SP1SP1<SP1>
SPARQL
Query Language for RDF– Based on RDF Data Model
Possible to write complex joins of disperate
datasets
Implemented by all major RDF databases
SPARQL Protocol and RDF Query Language
See more: http://www.w3.org/TR/rdf-sparql-query/
Structure of a SPARQL Query
SPARQL query
SELECT ...
FROM ...
WHERE { ... }
SELECT clause
to identify the values to
be returned
FROM clause
to identify the data
sources to query
WHERE clause
the triple/graph pattern to
be matched against the
triples/graphs of RDF
WHERE clause
a conjunction of triples:{ ?x rdf:type ex:Person
?x ex:name ?name }
PREFIX
to declare the schema
used in the query
example persons and their names
PREFIX ex: <http://ex.org/schema#>
SELECT ?person ?name
WHERE {
?person rdf:type ex:Person
?person ex:name ?name .
}
example of result
<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#" >
<head>
<variable name="person"/>
<variable name="name"/>
</head>
<results ordered="false" distinct="false">
<result>
<binding name="person">
<uri>http://ex.org/schema#whf</uri>
</binding>
<binding name="name">
<literal>haofen</literal>
</binding>
</result>
<result> ...
FILTER
to add constraints to the
graph pattern (e.g., numerical like X>17 )
example persons at least 18-year old
PREFIX ex: <http://ex.org/schema#>
SELECT ?person ?name
WHERE {
?person rdf:type ex:Person
?person ex:name ?name .
?person ex:age ?age .
FILTER (?age > 17)
}
FILTER can use many
operators, functions (e.g.,
regular expressions), and
even users' extensions
OPTIONAL
to make the matching of
a part of the pattern
optional
example retrieve the age if available
PREFIX ex: <http://ex.org/schema#>
SELECT ?person ?name ?age
WHERE {
?person rdf:type ex:Person
?person ex:name ?name .
OPTIONAL { ?person ex:age ?age }
}
UNION
to give alternative
patterns in a query
example explicit or implicit adults
PREFIX ex: <http://ex.org/schema#>
SELECT ?name
WHERE {
?person ex:name ?name .
{
{ ?person rdf:type ex:Adult }
UNION
{ ?person ex:age ?age
FILTER (?age > 17) }
}
}
Sequence & modify
ORDER BY to sort
LIMIT result number
OFFSET rank of first result
example results 21 to 40 ordered by name
PREFIX ex: <http://ex.org/schema#>
SELECT ?person ?name
WHERE {
?person rdf:type ex:Person
?person ex:name ?name .
}
ORDER BY ?name
LIMIT 20
OFFSET 20
negationis tricky and errors can easily be
made.
103
? does this find persons who do not know "java" ?104
PREFIX ex: <http://ex.org/schema#>
SELECT ?name
WHERE {
?person ex:name ?name .
?person ex:knows ?x
FILTER ( ?x != "Java" )
}
NO! also persons who know something else !
105
PREFIX ex: <http://ex.org/schema#>
SELECT ?name
WHERE {
?person ex:name ?name .
?person ex:knows ?x
FILTER ( ?x != "Java" )
}
haofen ex:knows "Java”
haofen ex:knows "C++”
haofen is a answer...
ASK
to check just if there is at
least one answer ; result
is "true" or "false"
example is there a person older than 17 ?
PREFIX ex: <http://ex.org/schema#>
ASK
{
?person ex:age ?age
FILTER (?age > 17)
}
SPARQL protocol
sending queries and their
results accross the web
examplewith HTTP Binding
GET /sparql/?query=<encoded query> HTTP/1.1
Host: www.ecust.edu.cn
User-agent: my-sparql-client/0.1
#prefix declaration
prefix dbp-ont: <http://dbpedia.org/ontology/>
Prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
#result clause
SELECT *
#dataset definition
FROM <http://dbpedia.org>
#query pattern
WHERE {
?s rdf:type dbp-ont:Person .
?s rdf:type dbp-ont:Astronaut.
?s dbp-ont:status "Retired"@en.
?s dbp-ont:birthDate ?date
} ORDER BY ?date,
LIMIT 10 110
SELECT query: Find 10 of this and
order it by date: ORDER BY
Some one who is
Person & Astronaut
& Retired & youngest
first
Comparison with RDB
One-to-Many Relational Model
Equivalent Semantic Model - Easy
<triple 32: "person2" "type" "person"><triple 33: "person2" "first-name" "Rose"><triple 34: "person2" "middle-initial" "Elizabeth"><triple 35: "person2" "last-name" "Fitzgerald"><triple 36: "person2" "suffix" "none"><triple 37: "person2" "alma-mater" "Sacred-Heart-Convent"><triple 38: "person2" "birth-year" "1890"><triple 39: "person2" "death-year" "1995"><triple 40: "person2" "sex" "female"><triple 41: "person2" "spouse" "person1"><triple 58: "person2" "has-child" "person17"><triple 56: "person2" "has-child" "person15"><triple 54: "person2" "has-child" "person13"><triple 52: "person2" "has-child" "person11"><triple 50: "person2" "has-child" "person9"><triple 48: "person2" "has-child" "person7"><triple 46: "person2" "has-child" "person6"><triple 44: "person2" "has-child" "person4"><triple 42: "person2" "has-child" "person3"><triple 60: "person2" "profession" "home-maker">
Semantic Model – Explicit Relationships
ZJU located_in Hangzhou
Hangzhou located_in China
located_in type transitiveProperty
Relationship Model
ZJU located_in China
Information Inferred
Question
In which country is ZJU located?
Answer
In China
Information Given
Relationships are explicit in the model and directly
available to applications!
Where are the relationships?
Relational Model – Implicit Relationships
ID Company Name
IDC ZJU
City Country
Hangzhou China
City_ID CO_ID
China IDC
Company Table
City Table
Company_CityTable
Question
In which country is ZJU located?
Answer
In China
Develop a Query
Select Country
From Company Table, City Table, Company_City Table
Where Company Name = “ZJU” and ID = CO_ID and City =
City_ID
Relationships are in documents, SQL
code and collective memories - not
available to applications!
Where are the relationships?
Data Definition Statements? Applications do not use them, they are not descriptive and their scope
is a single database
Data Dictionary? Data Registry? They are for
human, not computer use
When Changes Needed w/ Semantic Model
ZJU located_in Hangzhou
Hangzhou loacted_in Zhejiang
Zhejiang located_in China
Located_in type transitiveProperty
Information Given
Hangzhou located_in China
ZJU located_in China
Information Inferred
Question
In which country is ZJU located?
Answer
In China
Relationship Model
Hangzhou located_in China
new dataChanges are Easy to Make
ID Company Name
IDC ZJU
City Country
Hangzhou China
City_ID CO_ID
Hangzhou IDC
Company Table
City Table
Company_City Table
ID Company Name
IDC ZJU
State Name ID Country
Zhejiang ZJ China
City_ID CO_ID
Hangzhou IDC
City_ID State_ID
Hangzhou ZJ
Company Table
State Table
Company_City Table
City_State Table
Question
In which country is ZJU located?
Using the same querySelect Country
From Company Table, City Table, Company_City Table
Where Company Name = “ZJU” and ID = CO_ID and City =
City_ID
Get No Answer!?
When Changes Needed w/ Relational Model
Doesn’t workany more!
Changes should be avoided at ALL costs
“Smart” Data vs. Dumb Data
Depends on where “smart” is
Dumb Data(e.g., RDB)
SmartApplication
Code(SQL codes)
Smart Data
(RDF/OWL ontology)
UniformInference Engine
Today Tomorrow
Triple Database as Data Warehouse
one-to-many relations are directly encoded
without the indirection of tables
Add new predicates (attributes) or class
hierarchy without changing any schema
Never think about what to index because all
the predicates are indexed
Ideal as data repository (warehouse) for
heterogeneous data sources
It’s a large-scale graph database
Ad hoc query is easy without schema
Successful Cases of KG
in Enterpriese
Biodiversity Repository
Challenges for Biodiversity Repository
n Very diverse subjects, even just for flora
Field rule code Record ID Version Version status Record status Name for list view Primary level Secondary level Tertiary level Borneensis no. Sufix No. Web releaseattrul id version verstat recstat brief maxcls subcls mincls entryno subno webflg
MA 25 1 1 1 Echinosorex gymnurus Mammals BOR-00000-03065 YesMA 27 1 1 1 Hylomys suillus Mammals BOR-00000-03067 YesMA 29 1 1 1 Suncus murinus Mammals BOR-00000-03069 NoMA 33 1 1 1 Tupaia glis Mammals BOR-00000-03073 No
Registration no. Old Borneensis no Registration date Collection date Collector's name Country State District Village or nearest village Specific localityRegno OldRegno Regdate collectiondate Collector country State District Village locality
MA0000005 9/15/2004 Henry Benard Malaysia Sabah Lahad Datu Tabin Forest Reserve, Lahad DatuMA0000007 9/15/2004 S.Yasuma MalaysiaMA0000009 9/15/2004 S.Yasuma MalaysiaMA0000013 9/15/2004 21/5/1999 Arifin Ag. Ali Malaysia Sabah Tawau Lembangan Maliau Basin
Latitude Longitude Altitude(Sign) Altitude Habitat type Substrate Ecological data Method of capture/collection Specimen preparation Specimen part Sex Total LengthLatitude Longitude Altitude-kbn Altitude Habita Substrate Ecological capture preparation Specimenpart Sex Total Length
Female 625 mm
Male
Tail length Weight Head-body length Hind foot length Forearm length Ear Other measurement Identification date Identifier Identification note Phylum Phylum(ID)meamethod meavalue HB length hindfoot forearm Ear Othemeasure Identdate Identifier Identnote phylum phylum-id
225 mm 400 mm 65 mm 30 mm
CHORDATA207.0 mm 180.0 g 208.0 mm 46.0 mm 16.0 mm CHORDATA
Credited to Universiti Malaysia Sabah
Sample Fauna Species Data on RDB Table
Subphylum Subphylum(ID) Superclass Superclass(ID) Class Class(ID) Subclass Subclass(ID) Superorder Superorder(ID) Order Order(ID) Suborder Suborder(ID)subphylum subphylum-id superclass superclass-id Class Class-id subclass subclass-id superorder superorder-id order order-id suborder suborder-id
INSECTIVORAInsectivora
VERTERBRATA MAMMALIA Insectivora VERTERBRATA MAMMALIA Scandentia
Superfamily Superfamily(ID) Family Family(ID) Subfamily Subfamily(ID) Genus Genus(ID) Species Species(ID) Subspecies Author Common name (English)superfamily superfamily-id family family-id subfamily subfamily-id genus genus-id species species-id subspecies author English
ERINACEIDAE Hylominae ID:MA00000385 Echinosorex gymnurusErinaceidae Hylomys suillus Lesser GymnureSoricidae ID:MA00000428 Suncus murinus House ShrewTupaiidae Tupaiinea Tupaia glis ID:MA00000483 Common Treeshrew
Common name (local language) Type status Conservation Status Distribution Preservation method Jar no. Room no. Compactor no. Bay no. Shelves no. Container/Box/Jar no.locallang Typestatus consst distribution Preservation method Jarno roomno compactor bayno shelvesno Containerno
Wet room Wet(Eg-01)Tikus babi Dry room Dry(Hs-01)Cencurut Rumah Wet Room Wet(Sm-01)Tupai Moncong Besar Dry specimen Dry Room Dry(Tg-01)
Loaned ID Loaned to (Name & address) E-mail Telephone Fax Country (Borrower) Date loaned Due date Date returned Remarks Multimedia link Release flag Release level Regn statusloanedID loanedto loanmail phone fax countryloan Loaned duedate Returned remarks medialink opnflg opnlvl matst
Malaysia 10000 20Malaysia 10000 20Malaysia 10000 20Malaysia 10000 20
Horrendous table schema
More than 70% of table cells contain null value
Need to call in experts to update schema
A Sample Fauna Species Data (cont’d)
Challenges for Biodiversity
Repository
Many islands of biodiversity information
Some estimate only 10% of information are known
(collected)
We don’t even know what else to come
A mammoth data integration problem,
let alone integrated understanding &
knowledge discovery
Try to design a schema for collection
data tables and data warehouses !!
Perfect Application of
Semantic Database
Life Science Knowledge Base
Challenges for Life Science –
Diversity Very diverse subjects
How to relate all the information cohesively?
Challenges for Life Science
– Taxonomy Different disciplines use different taxonomies even
for the same thing
Physiologist
GeneticistPharmacologist
BiochemistVirologist
Designed for human (90%+), not for computer
Challenges for Life Science –
Knowledge Representation
RDF Class Hierarchy Maps
Taxonomy
NCI ontology – a comprehensive biomedical
taxonomy, containing 1,200,000 concepts
mapped to 2,900,000 terms with 5,000,000
relationships, e.g., Medicine
Medical_Specialties Radiology
Radiology_Therapeutic
Radiology_Bone
Radiology_Dental
Pediatric_Radiology
Nuclear_Medicine Medical_Radiation_Physics
Diagnostic_Radiology_Ionizing_and_Nonionizing_
Radiology_Thorax_Chest
Radiology_Soft_Tissue
Radiology_Head_Neck
Interventional_Radiology
Looking for Alzheimer
Disease Targets
Signal transduction pathways
are considered to be rich in
“druggable” targets - proteins
that might respond to chemical
therapy
CA1 Pyramidal Neurons are
known to be particularly
damaged in Alzheimer’s disease.
Can we find candidate genes
known to be involved in signal
transduction and active in
Pyramidal Neurons?
A SPARQL Query Spanning 4 Sources
SPARQL makes ad hoc queries over
multiple data sources (in RDF) easy
Ad hoc Tracking & Capturing of
Component Properties & Processes
NASA Space Shuttle Launch
Maintenance
Encode the complete maintenance rules &
process (millions of them) of all components
(inter-dependent) in a knowledgebase
Provide process guidance, monitoring,
validation, QA and QC for space shuttle
launch maintenance
Statoil Exploration
Siemens Energy Service
A General Pipeline to Publish
and Explore Knowledge Graph
Architecture scenarios
140
Motivation: Music!
Visualization
Module
Metadata
Streaming providers
Physical Wrapper
Downloads
Da
ta a
cq
uis
itio
n D2R Transf.LD Wrapper
Musical Content
Ap
plic
atio
n
Analysis &
Mining Module
LD
Da
tase
tA
cce
ss
LD Wrapper
RDF/
XML
Integrated
DatasetInterlinking Cleansing
Vocabulary
Mapping
SPARQL
Endpoint
Publishing
RDFa
Other content
Large KBs You Need to Know
DBpedia
DBpedia is a crowd-sourced community effort
to extract structured information
from Wikipedia and make this information
available on the Web. DBpedia allows
you to ask sophisticated queries against
Wikipedia, and to link the different data sets
on the Web to Wikipedia data.
http://dbpedia.org/
DBpedia
The DBpedia Ontology is a
shallow, cross-domain
ontology, which has been
manually created based
on the most commonly used
infoboxes within Wikipedia.
The ontology currently covers
685 classes which form
a subsumption hierarchy
and are described by 2,795
different properties. http://dbpedia.org/
DBpedia
The DBpedia data set uses a large multi-
domain ontology which has been derived from
Wikipedia. The English version of the DBpedia
2014 data set currently describes 4.58 million
“things” with 583 million “facts”.
http://dbpedia.org/
YAGO
YAGO (Yet Another Great Ontology) is
a knowledge base developed at the Max
Planck Institute for Computer
Science in Saarbrücken. It is automatically
extracted from Wikipedia and other sources.
YAGO
YAGO2s(Stable release) is a huge semantic
knowledge base, derived
from Wikipedia WordNet and GeoNames.
Currently, YAGO2s has knowledge of more
than 10 million entities (like persons,
organizations, cities, etc.) and contains more
than 120 million facts about these entities.
http://www.mpi-inf.mpg.de/departments/databases-and-
information-systems/research/yago-naga/yago/
YAGO Demo
https://gate.d5.mpi-inf.mpg.de/webyagospotlx/Browser
https://gate.d5.mpi-inf.mpg.de/webyagospotlx/WebInterface
Freebase
A community-curated database of well-known
people, places, and things.
It is an online collection of structured
data harvested from many sources, including
individual, user-submitted wiki contributions.
http://www.freebase.com/
Freebase
NELL
NELL (Never-Ending Language Learner) can
extract facts from text found in hundreds of
millions of web pages and improve its reading
competence, so that tomorrow it can extract
more facts from the web, more accurately.
http://rtw.ml.cmu.edu/rtw/
NELL
NELL has accumulated over 50 million candidate
beliefs by reading the web, and it is considering
these at different levels of confidence. NELL has
high confidence in 2,180,254 of these beliefs.
Entity Linking
Public Toolkits and Web Services for
Entity Linking
Wikipedia Miner
TagMe
DBpedia Spotlight
Illinios Wikifier
AIDA
(OpenCalais)
Wikipedia Miner [Milne & Witten 2008b]
Open source
(Public) web service
– Java
– Hadoop preprocessing pipeline
Lexical matching + machine learning
Target KB: Wikipedia
See http://wikipedia-miner.cms.waikato.ac.nz
TagMe [Ferragina & Scaiella 2010]
Web service only (demo + API)
Approach similar to Wikipedia Miner
– Voting for disambiguation
– based on all possible bindings
heuristics to select best target
Designed for short texts
Target KB: Wikipedia
See http://tagme.di.unipi.it/
Illinois Wikifier [Ratinov et al. 2011]
Local install + online demo– uses Illinois NER system
Disambiguation as weighted sum of features– Textual similarity
– Global coherence based on link structure
Target KB: Wikipedia
See http://cogcomp.cs.illinois.edu/page/software_view/33
Demo:
http://cogcomp.cs.illinois.edu/demo/wikify/?id=25
DBpedia Spotlight [Mendes et al., 2011]
Open source
Public web service
Disambiguation in local context
– vector-space model using bag-of-words and cosine
similarity
– (actually, Lucene)
Target KB: DBpedia
See http://spotlight.dbpedia.org
Demo: http://dbpedia-spotlight.github.io/demo/
AIDA [Yosef et al. 2011]
Open source
– uses Stanford NER system
(Public) web service, API
Links to YAGO2
Disambiguation in 3 variants
– PriorOnly: link to most common target
– Local: disambiguate individual links with local features
– CocktailParty: collective disambiguation maximizing
coherence using iterative graph-based approach
Target KB: YAGO2
See http://www.mpi-inf.mpg.de/departments/databases-
and-information-systems/research/yago-naga/aida/
Demo: https://gate.d5.mpi-inf.mpg.de/webaida/
OpenCalais
Only on public content
– does not keep a copy of content
– keeps a copy of the metadata it extracts
Free for up to 50,000 documents per day
Early adopters:
– CBS Interactive / CNET, Huffington Post, Al Jazeera,
The White House
– more than 30,000 developers && 50 publishers
Target KB: Calais
See http://www.opencalais.com/
Demo: http://viewer.opencalais.com/
Knowledge Acquisition
from Unstructured Texts
OpenIE/TextRunner Learn syntactic patterns to extract any relation
instances from any domains from text
Completely unsupervised, no need for seeds
• Input: corpus C,
• Output: a set of extracted relations
parser phase on a portion of C, pattern generation
from parsed documents, t: <e1, r, e2>
Reverb Automatically identifies and extracts binary
relationships from English sentences.
Designed for Web-scale information extraction
Consider all verbal phrases as potential relations
and all noun phrases as arguments
Target relations cannot be specified in advance
Input: raw text
Output: (argument, relation phrase, argument2) triples
For example:
• Input: Bananas are an excellent source of potassium.
• Output: (bananas, be source of, potassium)
Reverb (cont’d)
https://github.com/knowitall/reverb
Ollie Automatically identifies and extracts binary relationships
from English sentences. Designed for Web-scale
information extraction, where target relations are not
specified in advance.
Ollie also captures context that modifies a binary relation.
Presently Ollie handles attribution (He said/she believes)
and enabling conditions (if X then).
https://github.com/knowitall/ollie
Enabling Condition:
Sentence: If I slept past noon, I'd be late for work.
Extraction: (I, 'd be late for, work) [enabler=If I slept past noon]
Ollie (cont’d)
Attribution:
Sentence: Some people say Barack Obama was not born in the United States.
Extraction:(Barack Obama, was not born in, the United States [attrib=Some
people say]
Relational noun:
Some relations are expressed without verbs. Ollie can
capture these as well as verb-mediated relations
Sentence: Microsoft co-founder Bill Gates spoke at a conference on Monday.
Extraction: (Bill Gates, be co-founder of, Microsoft)
N-ary extractions:Sentence: I learned that the 2012 Sasquatch music festival is scheduled for May
25th until May 28th.
Extraction: (the 2012 Sasquatch music festival, is scheduled for, May 25th)
Extraction: (the 2012 Sasquatch music festival, is scheduled until, May 28th)
N-ary: (the 2012 Sasquatch music festival, is scheduled, [for May 25th, to May 28th])
Ollie (cont’d)
SRLIE Automatically identifies n-ary extractions from English
sentences.
Designed for Web-scale information extraction, where
target relations are not specified in advance.
Builds extractions from Semantic Role Labelling .
https://github.com/knowitall/srlie
Chunked Extractors
https://github.com/knowitall/chunkedextractor
a collection of three extractors:
• ReVerb -- an extractor for verb-mediated relations
• Sally sells sea shells
• Relnoun -- an extractor for noun-mediate relation
• United States president Barack Obama
• Nesty -- an extractor for nested relations
• Some people say that we never landed on the moon
Learn
syntactic
patterns
TextRunner
Consider verbal phrases as
relations and noun phrases
as arguments
ReVerb
Extract relations are
expressed without verbs,
handle attribution
Ollie
Extract n-ary
extractions
SRLIE
binary relationships
Compare different open IE system:
SOFIE:
Extract ontological facts from natural language documents and
link the facts into an ontology.
Uses logical reasoning on the existing knowledge and on the
new knowledge in order to disambiguate words to their most
probable meaning
Unites pattern matching, word sense disambiguation and
ontological reasoning in one unified model
• Input :target relations and type signature for
involved entities
http://www.mpi-inf.mpg.de/yago-naga/sofie/
Extending a KB faces 3+ challenges
type(Reagan, president)
spouse(Reagan, Davis)
spouse(Elvis, Priscilla)
(F. Suchanek et al.: WWW‘09)
Problem: If we want to extend a KB, we face (at least) 3 challenges
1. Understand which relations are expressed by patterns
"x is married to y“ spouse(x, y)
2. Disambiguate entities
"Hermione is married to Ron": "Ron" = RonaldReagan?
3. Resolve inconsistencies
spouse(Hermione, Reagan) & spouse(Reagan, Davis) ?
"Hermione is married to Ron"
?
18
1
PROSPERA
N-gram item-set patterns to generalize narrow
syntactic patterns to boost recall(different from
SOFIE)
Reasoning with large KB (YAGO) to constrain
extractions to boost precision
• Input :target relations and type signature for
involved entities
http://www.mpi-inf.mpg.de/yago-naga/sofie/
Graph Database (with
Reasoning Supports)
Current Graph databases (selected)
Open source– Bigdata
– Sesame
– Jena
– Neo4j
Commercial Edition– Virtuoso
– BigOwlim
– AllegroGraph
Bigdata
High-performance
Supporting the RDF data
model and RDR.
Embedded database or over a
client/server REST API.
High-availability and dynamic
sharding.
Blueprints and Sesame APIs.
High-level query with SPARQL
http://www.bigdata.com/
Sesame
An Java framework for processing RDF data.
Easy-to-use API can be connected to RDF storage
solutions.
SPARQL endpoints
two out-of-the-box RDF databases (the in-memory
store and the native store
supporting all mainstream RDF file formats
http://rdf4j.org/
Jena
A free and open source
Java framework for
building Semantic
Web and Linked
Data applications
Developed by HP
Laboratories
In-memory or persistent
storage
http://jena.apache.org/
Neo4j
http://neo4j.com
A Graph database + Lucene index
Property Graph
Full ACID
(atomicity, consistency, isolation, durability)
High Availability (with Enterprise Edition)
32 Billion Nodes,32 Billion Relationships,
64 Billion Properties
Embedded server
REST API
Neo4j
Good for– Highly connected data
– Recommendations
– Path Finding
– A*
– Data First Schema
http://neo4j.com
Virtuoso
Smart Data & Virtualization & Integration
Scalable & High-Performance Data Management
Web-scale identity & Security
Standards Compliance
http://virtuoso.openlinksw.com/
Virtuoso
Unique hybrid
server
architecture
http://virtuoso.openlinksw.com/
BigOwlim
The world’s leading RDF
Triplestore and graph database
The only triplestore can perform
semantic inferencing at scale
Allowing users to create new
semantic facts from existing facts
Handling massive loads, queries
and inferencing in real time
http://www.ontotext.com/owlim
Allegrograph
http://www.franz.com/agraph/allegrograph
A modern, high-performance, persistent graph database
All Clients based on REST Protocol – Java Sesame, Java Jena, Python,etc
Allegrograph
AllegroGraph is designed for maximum loading speed
and query speed and High-performance storage
http://www.franz.com/agraph/allegrograph
Knowledge Integration
Falcon-AO
Ontology Matching(classes, properties and instances)
LMO: Linguistic matching– Lexical Comparison(string similarity: SS): edit distance
– Statistic Analysis(document similarity: DS): VSM, virtual document of entity from labels, names, comments as well as ones from neighbors.
– Linguistic Similarity=0.8*DS + 0.2*SS
GMO: Graph matching– Similarity of two entities from two ontologies comes from the
accumulation of similarities of involved statements (triples) taking the two entities as the same role (subject, predicate,object) in the triples
– Similarity of two statements comes from the accumulation of similarities of involved entities of the same role in the two statements being compared.
– Input: A set of matched entities. Output: Additional matched entities
http://ws.nju.edu.cn/falcon-ao
Falcon-AO
Falcon-AO
BLOOMS Ontology Alignment for Linked Open Data
Ontology Alignment(classes)
Construction of BLOOMS forest
Comparison of BLOOMS forests– Given two forests TC, TD, for any Ts∈ TC, Tt∈ TD
– If Ts=Tt, then C owl:equivalentClass D
– If overlap(Ts,Tt)≤ overlap(Tt,Ts), then
C owl:subclassOf D,else D owl:subclassOf C
http://wiki.knoesis.org/index.php/BLOOMS
PARIS PARIS: Probabilistic Alignment of Relations, Instances, and Schema
Ontology Alignment(classes, relations, instances)
Probabilistic Model
http://webdam.inria.fr/paris/
PARIS
Functionality
PARIS
Equality of Instances
PARIS
Equality of Classes– If all the instances of one class are instances of the other
then the former subsumes the latter
Equality of Relations– If every pair of one relation is a pair of another relation, then
the first is a sub-property of the second
Silk Discovering and Maintaining Links on the Web of Data
Discovering relationships between instances
Components:– Link Discovery Engine
• Link Specification Language
• Computes links between data sources based on a
declarative specication of the conditions
– Generated Links Evaluation
• Fine-tune the linking specication
– A protocol for maintaining data links
• Allows data sources to exchange both linksets as well as detailed change
information and enables continuous link recomputation.
http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/
Silk
Silk
It enables the user to manage different sets of data sources and
linking tasks.
It offers a graphical editor which enables the user to easily create
and edit link specifications.
It allows quickly evaluate the links
It allows the user to create and edit a set of reference links used
to evaluate the current link specification.
Comparison
Class Property Instance
Falcon-AO √ √ √
BLOOMS √
PARIS √ √ √
SILK √
Knowledge Exploration
Gruff http://franz.com/agraph/gruff/
Interactive Relational Data Navigation
http://www.sindicetech.com/pivotbrowser.html
Exhibit – SMILE widgets
http://www.simile-widgets.org/exhibit/
Open Source One-stop
Solution
Linked Media Framework and Marmotta
LMF is build on top of three Apache projects:
Apache Marmotta provides the Lined Data Platform
capabilities
Apache Stanbol is the extraction and enhancement
framework used
Apache Solr provides indexation capabilities
The glue that LMF implements allows to get the best
of these three projects for providing advance linked
media capabilities, such as semantic search or
semantic enrichment.
Knowledge Graph
Tables
Data Graphs
References, Key
Concepts,
Relations
External Domain DataUnstructured/Semi-structured content
Customer Data
Enrichment and Encoding via
Domain Ontology
• Search++
• Recommendations
• Vertical applications
• Explorative interfaces
Relational DB
Align
An Enterprise Knowledge Graph
Publishing Legacy Data as Linked Data
Google Refine (RDF Extension)
Apache Stanbol
Publishing Legacy Data as Linked Data
Publishing Legacy Data as Linked Data
Publishing Legacy Data as Linked Data
Publishing Legacy Data as Linked Data
Publishing Legacy Data as Linked Data
References
fabien gandon. RDF in a nutshell.
fabien gandon. SPARQL in a nutshell
fabien gandon. WWW 2014 tutorial on
Semantic Web
We adapt the above slides to introduce RDF
and SPARQL
Thank you!
Any questions?