afraz jaffri, hugh glaser, ian millard electronics and computer science university of southampton

21
Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

Upload: haley-strickland

Post on 28-Mar-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

Afraz Jaffri, Hugh Glaser, Ian MillardElectronics and Computer Science

University of Southampton

Page 2: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

2SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

1. Linked Data2. URI Multiplicity3. The Problem of Coreference4. URI Identity Management Approaches5. The Problem with owl:sameAs6. The Consistent Reference Service (CRS)7. CRS Architecture8. A CRS Application: The RKB Explorer9. Summary and Future Work

Page 3: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

3SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• DBpedia has URIs for approximately 2 million entities• Linked datasets contain many overlapping entities• A single entity can have a number of URI’s• Entities are linked using owl:sameAs

Example

<http://dbpedia.org/resource/Berlin> <owl:sameAs> <http://sws.geonames.org/2950159>

Page 4: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

4SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

http://www.rkbexplorer.com

• Contains URIs for more than 10 million entities• Data relating to people, projects, papers and

institutions• A single entity has a number of URIs (even within

the same repository)• Entities are linked using CRSs

DBLP

Page 5: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

5SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

URIs for ‘Spain’:http://dbpedia.org/resource/Spainhttp://ww4.wiwiss.fu-berlin.de/factbook/resource/Spainhttp://sws.geonames.org/2510769http://www4.wiwiss.fu-berlin.de/eurostat/resource/countries/Espa%C3%Bla

URIs for ‘Hugh Glaser’:http://acm.rkbexplorer.com/rdf/resource-P112732

http://citeseer.rkbexplorer.com/rdf/resource-CSP109020 http://citeseer.rkbexplorer.com/rdf/resource-CSP109013 http://citeseer.rkbexplorer.com/rdf/resource-CSP109011 http://citeseer.rkbexplorer.com/rdf/resource-CSP109002 http://dblp.rkbexplorer.com/rdf/resource-27de9959 http://europa.eu/People/#person-0ff816fa http://resist.ecs.soton.ac.uk/wiki/User:hugh_glaser http://www.ecs.soton.ac.uk/info/#person-00021

Page 6: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

6SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

Tom Anderson – http://www4.wiwiss.fu-berlin.de/dblp/resource/person/109074

Is dc:creator of <http://www4.wiwiss.fu berlin.de/dblp/resource/record/conf/dac/MorettiHNCKABDF01> is dc:creator of

<http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ftcs/SaeedLA91> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ftrtft/LemosSA92>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/hybrid/AndersonLFS92>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/iccbss/AndersonFRR03>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/iciap/TruccoARI05>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/icnp/ElySWSA01> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ifip/AndersonRR04>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/sc/BorchersASW95>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/seaai/AndersonH98> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/srds/Anderson86> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/words/AndersonFRR05>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/bell/LiuBFSRA04> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/cj/LemosSA92>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/dt/Anderson01>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/dt/Anderson03> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/dt/ZorianASTI96> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/software/LemosSA95> is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/ton/SavageWKA01>is dc:creator of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/journals/tse/AndersonBHM85> is dblp:editor of <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/sigcomm/2006>

Vice President O-in Design Automation inc. USAProfessor, University of NewcastleProfessor, Heriot Watt UniversityUniversity of WashingtonUniversity of California, BerkelyTom Andersen - University of DenmarkLucent Technologies, Illinois

Page 7: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

7SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• The problem of coreference has existed for many years

• Physical Libraries disambiguate authors through Date of Birth

• Digital Libraries still have the problem of author disambiguation

• Problems caused by variations in naming schemes e.g. ‘Glaser, H.’

‘H. Glaser’ ‘Glaser, Hugh’ ‘H. Glazer’

Page 8: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

8SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• Coreference Problem referred to as ‘Record Linkage’

• Matching entities between records similar to matching entities between datasets

• Database linkage is easier due to imposed schema

• Formal theory of Record Linkage proposed by Fellegi & Sunter (1969)

• Uses coded agreements between each field (property) to give the probability of record (instance) equivalence

• Can be adapted for use on the Semantic Web

Page 9: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

9SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• Coreference on the Semantic Web is defined as being the situation where two or more URI’s are used for a single non-information resource

• URI usage can change with context

• Non-Information resources are hard to define precisely

Examples

‘Hugh Glaser’ at Southampton vs. ‘Hugh Glaser’ at Imperial

‘Harry Potter and the Order of the Phoenix’ in Hardback vs. Softback ISBN: 978-0747561071 978-

0747551003

Page 10: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

10SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• Use a centralised naming authority to issue URIs for every entity in the world

• Let everyone create their own URIs and link them to ‘official’ URIs (using owl:sameAs)

• Let everyone create their own URIs and register them at a centralised repository

• Let everyone create their own URIs and let them be managed by many decentralised repositories

• In all of the above encourage reuse and linking as far as possible

Page 11: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

11SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• owl:sameAs was designed for a specific purpose• Resources linked with owl:sameAs have the same

identity i.e. The subject and object are exactly the same resource

• owl:sameAs has been misused for Linking Open Data

• Linking can occur between two very different resources, e.g. Tom Anderson

• Reasoning with LOD will have unintended consequences

Page 12: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

12SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

<rdf:Description rdf:about=“#URI-1”> <rdf:Description rdf:about=“#URI-2”> <vcard:FN>Hugh Glaser</vcard:FN> <vcard:FN>Hugh Glaser</vcard:FN><vcard:EMAIL>[email protected]</vcard:EMAIL> <vcard:EMAIL>[email protected]</vcard:EMAIL><vcard:ROLE>Reader</vcard:ROLE></rdf> <vcard:ROLE>Lecturer</vcard:ROLE></rdf>

Assert <URI-1> <owl:sameAs> <URI-2>

SELECT ?x WHERE {<URI-1> vcard:EMAIL ?x}

Returns [email protected] [email protected]

Which email belongs to which role?

Using owl:sameAs means that both URI’s become indistinguishable even though they may refer to different entities according to the context in which they are used.

Page 13: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

13SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• Data (Knowledge) providers publish data (knowledge)• Resources from one provider cannot be guaranteed to

be the same as resources from another provider• Knowledge will be published and made

dereferenceable at the domain that the publisher has control over

• URIs will be constructed from the domain name of the publisher’s site

• An intermediate service groups URIs of resources that may be the same

• This knowledge is made available upon dereferencing the URI of a resource

Page 14: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

14SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• Can be seen as a conventional Knowledge Base• Contains knowledge about the URIs in a

repository• URIs referring to the same resource are grouped

together in ‘Bundles’• A Bundle has properties:• Coref:hasEquivalentReference – The URIs in a bundle are

grouped together using this predicate

• Coref:hasCanonicalReference – One URI in a bundle can be made to be the canonical representation i.e. The preferred URI

• Coref:updatedOn – The date of the last update to the bundle

Page 15: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

15SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

@prefix coref: <http://www.resist.ecs.soton.ac.uk/ontology/coref#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://citeseer.rkbexplorer.com/crs/coref#bundle1> a coref:Bundle ;

coref:hasCanonicalReference

<http://citeseer.rkbexplorer.com/rdf/resource-CSP109002> ;

coref:hasEquivalentReference <http://citeseer.rkbexplorer.com/rdf/resource-CSP109011> , <http://citeseer.rkbexplorer.com/rdf/resource-CSP109020> , <http://citeseer.rkbexplorer.com/rdf/resource-CSP109013> , <http://citeseer.rkbexplorer.com/rdf/resource-CSP109002> .

Page 16: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

16SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

http://southampton.rkbexplorer.com/id/person-00021

RESOLVE

RETRIEVE

RDF

RDF

http://southampton.rkbexplorer.com/data/person-00021

http://southampton.rkbexplorer.com/description/person-00021

KB

CRS

Non-Information Resource

Information Resource

Information Resource

Text/Html RDF/XML

Application

Page 17: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

17SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• Finding all equivalences (bundles) is up to the application

• A separate activity from coreferencing a single data source

• Services such as Sindice can perform this function for free

• To perform the equivalence closure just follow the crs:hasCRS links

• Scalability is ensured by not including all possible bundles in every CRS

Page 18: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

18SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• The Resilience Knowledge Base Explorer displays communities of practice for people, projects and publications from the RKB

• Uses multiple CRSs to disambiguate people and publications

• One CRS per knowledge base ensures scalability• Multiple SPARQL queries• Look yourself up!• www.rkbexplorer.com/explorer

Page 19: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

19SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• Equivalence Mining is a difficult task that requires multiple algorithms

• Adding policies to determine the trust level of a CRS

• Establishing the authority of a CRS over a KB• Establishing performance metrics• Collaborating with LOD community for wide scale

deployment• Formalising the linking methodology

Page 20: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

20SSWS07 - Vilamoura, Potugal

URI Identity Management for Semantic Web Data Integration and Linkage

• Coreference exists in many disciplines and will exist on the Semantic Web

• The equivalence of non-information resources depends on context

• The semantics of owl:sameAs do not fit with the current usage in Linked Data

• The CRS is a solution that is being deployed on a large knowledge-based infrastructure

• Its my knowledge, so let me name it!

Page 21: Afraz Jaffri, Hugh Glaser, Ian Millard Electronics and Computer Science University of Southampton

SSWS07 - Vilamoura, Potugal 21

Questions?

URI Identity Management for Semantic Web Data Integration and Linkage