a distributional structured semantic space for querying rdf graph data

54
Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.ie A Distributional Structured Semantic Space for Querying RDF Graph Data André Freitas, Edward Curry, João G. Oliveira, Seán O’Riain

Upload: andre-freitas

Post on 10-May-2015

1.609 views

Category:

Education


2 download

DESCRIPTION

The vision of creating a Linked Data Web brings together the challenge of allowing queries across highly heterogeneous and distributed datasets. In order to query Linked Data on the Web today, end users need to be aware of which datasets potentially contain the data and also which data model describes these datasets. The process of allowing users to expressively query relationships in RDF while abstracting them from the underlying data model represents a fundamental problem for Web-scale Linked Data consumption. This article introduces a distributional structured semantic space which enables data model independent natural language queries over RDF data. The center of the approach relies on the use of a distributional semantic model to address the level of semantic interpretation demanded to build the data model independent approach. The article analyzes the geometric aspects of the proposed space, providing its description as a distributional structured vector space, which is built upon the Generalized Vector Space Model (GVSM). The final semantic space proved to be flexible and precise under real-world query conditions achieving mean reciprocal rank = 0.516, avg. precision = 0.482 and avg. recall = 0.491.

TRANSCRIPT

Page 1: A distributional structured semantic space for querying rdf graph data

Copyright 2009 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

A Distributional Structured Semantic Space for Querying RDF Graph Data

André Freitas, Edward Curry, João G. Oliveira, Seán O’Riain

Page 2: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Outline

Motivation & Problem Space Description of the Proposed Approach Evaluation Conclusions

Page 3: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Motivation & Problem Space

Linked Data: Heterogeneous and distributed data environment.

Traditional Database and Information Retrieval approaches for querying/searching Linked Data provide limited solutions for casual users.

Querying and searching Linked Data remains a fundamental challenge.

?

Page 4: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query/Search Spectrum

Adapted from Kauffman et al (2009)

Page 5: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Semantic Gap

From which university did the wife of Barack Obama graduate?

Data model independent (DMI) queries

Semantic Matching

Page 6: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Semantic Matching Problem

From which university did the wife of Barack Obama graduate?

Page 7: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Semantic Matching Problem

From which university did the wife of Barack Obama graduate?

Semantic matching

Page 8: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Semantic Matching Requirements

Expressivity: Supporting expressive natural language queries. Path, conjunctions, disjunctions, aggregations, conditions.

Flexibility & Accuracy: Accurate and flexible semantic matching approach.

Maintainability: Easily transportable across datasets from different domains. Can support Open domain (e.g. Wikipedia) and doman-specific

(e.g. Financial) datasets without customization. Performance:

Suitable for realtime querying (low query execution time). Scalability:

Scalable to a large number of datasets (Organization-scale, Web-scale).

Page 9: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Research Questions

Q1: How to provide a query mechanism which allows users to expressively query linked datasets without a previous understanding of the vocabularies behind the data?

Q2: Which semantic model could support this query mechanism?

Page 10: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Proposed Approach

Page 11: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Proposed Approach (Outline)

Treo: Query approach based on dynamic Linked Data navigation. Performance limitations.

Treo T-Space: Existing IR approaches: traditional Vector Space Models

(VSMs) were not able to: (i) capture the structure of graphs and (ii) support a precise semantic matching behavior.

A VSM supporting these two requirements was formulated: T-Space.

Ranking function based on semantic relatedness.

Page 12: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Proposed Approach (Outline)

Page 13: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

First Approach (Treo)

Natural language queries. Ranked query results. Two phase search process combining entity search

with spreading activation search. Use of Wikipedia-based semantic relatedness to

semantically match query terms to dataset terms. Limited query execution performance.

Requirements: Expressivity, Flexibility & Accuracy.

Page 14: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Semantic Relatedness

Computation of a measure of “semantic proximity” between two terms.

Allows a semantic approximate matching between query terms and dataset terms.

Most existing approaches use WordNet-based

solutions for approximate semantic matching (limited solution).

Use of Wikipedia-based semantic relatedness measures (addresses the limitations of WordNet-based measures).

Page 15: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query Approach Rationale (Treo)

Page 16: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query Approach Rationale (Treo)

RDF

Page 17: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query Approach Rationale (Treo)

Wikipedia Link Measure

Page 18: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query Approach Rationale (Treo)

Page 19: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query Approach Rationale (Treo)

RDF

Page 20: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query Approach Rationale (Treo)

Page 21: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query Approach Rationale (Treo)

Page 22: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query Approach Rationale (Treo)

Page 23: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Query Approach Rationale (Treo)

Final Query- Data Matching:

Querying Linked Data using Semantic Relatedness: A Vocabulary Independent Approach. In Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB) 2011.

Currently limited to path queries

Page 24: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Treo (Irish): Direction, path

Page 25: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

From which university did the wife of Barack Obama graduate?

Page 26: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Second Approach: Treo T- Space

Motivation: Performance problem of the original approach. Transportability across different domains.

Rationale: Rephrasing of the proposed approach as a vector

space model. Addressing the limitations of the vector space

model (BoW: lack of structure) to represent the structure present in the data.

Keeping the semantic matching behavior.

Requirements: Performance & Scalability, Maintainability.

Page 27: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

T- Space

A Vector Space Model for building RDF indexes that preserve the graph structure and which allows a semantic matching/search behavior.

Page 28: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Generalization (Treo T- Space)

Approach: Ranked results. Natural language queries. Two phase search process combining entity search

with spreading activation search. Use of distributional semantics to semantically

match query terms to dataset terms. Semantic manifold (‘structured vector space’)

model: T-Space.

Page 29: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Distributional Semantic Model

Assumption: the context surrounding a given word in a text provides important information about its meaning.

Simplified semantic model. Explicit Semantic Analysis (ESA) is the primary

distributional model used in this work.

A wife is a female partner in a marriage. The term "wife" seems to be a close term to bride, the latter is a female participant in a wedding ceremony, while a wife is a married woman during her marriage. ...

Page 30: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

T- Space

instances

classes

relations Distributional semantic model:

Semantic statistical knowledge extracted

from large Web corpora

Embedding the structure of a graph into a semantic vector space

Page 31: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Querying the T- Space

Page 32: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Querying the T- Space

Page 33: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Querying the T- Space

Page 34: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Querying the T- Space

A Multidimensional Semantic Space for Data Model Independent Queries over RDF Data. In Proceedings of the 5th IEEE International Conference on Semantic Computing (ICSC), 2011.

Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches and Trends. IEEE Internet Computing, Special Issue on Internet-Scale Data, 2012.

Page 35: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Ranking/Filtering Results

How to filter non-related results. Analysis of semantic relatedness as a ranking

function. Creation of a methodology for the determination of

a semantic relatedness threshold. Semantic differential analysis.

A Distributional Approach for Terminological Semantic Search on the Linked Data Web. 27th ACM Applied Computing Symposium, Semantic Web and Its Applications Track, 2012.

Requirements: Flexibility & Accuracy.

Page 36: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

VSM & Formalization

Based on the Generalized Vector Space Model. Simplification and clarification of T-Space

properties. Tensors are used to support domain specific

meaning. Tensors can be used to connect concepts across

datasets and T-Spaces

Page 37: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

VSM & Formalization

1. The distributional space can be transformed to the keyword space.

Requirements: Flexibility & Accuracy

Page 38: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

VSM & Formalization

2. The vector field defines an additional structure over the vector space model.

Requirements: Expressivity

Page 39: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Customized Distributional Model

Distributional models customizable for different datasets.

Page 40: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Customized Distributional Model

Transformation from different distributional models

DBPedia Financial Dataset

Wikipedia Financial Corpus

Requirements: Maintainability A Distributional Structured Semantic Space for Querying RDF Graph Data. International Journal of Semantic Computing (IJSC), 2012.

Page 41: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Evaluation (Treo T- Space)

Page 42: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Quality of Results

Q1: How does the approach work for addressing the vocabulary problem?

Q2: How does the approach work for addressing natural language queries? Precision, recall, mean reciprocal rank, % of

answered queries.

Page 43: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Quality of Results

Full DBPedia QuerySet (50 queries)

Avg. Precision Avg. Recall

MRR % of queries answered

0.482 0.491 0.516 58%

QALD DBPedia Training Set. 50 natural language queries. DBpedia 3.6.

Partial DBPedia QuerySet (38 queries)

Avg. Precision Avg. Recall

MRR % of queries answered

0.634 0. 645 0.679 76%

Page 44: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Error Distribution

Q4: What is the error associated with each component of the search approach? Error distribution measurements.

Page 45: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Error Distribution

Conclusion: Robustness of the distributional semantic model for open domain queries.

Conclusion: Major need for improvement of the pre/post processing phases.

Page 46: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Evaluating Terminology- level Semantic Matching

Q1: How does the approach work for addressing the vocabulary problem?

Q5: Does the approach improve the semantic matching of the queries?

Quantitative evaluation: P@5, P@10, MRR, % of the

queries answered, comparative evaluation using string matching and WordNet-based query expansion.

Page 47: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Evaluating Terminology- level Semantic Matching

Avg. Precision@5

Avg. Precision@10

MRR % of queries answered

0.732 0.646 0.646 92.25%

Approach % of queries answered

ESA 92.25% String matching 45.77%

String matching + WordNet QE 52.48%

Page 48: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Comparative Analysis

Q3: What is the best distributional semantic model for the vocabulary problem? Preliminary comparative analysis between

different distributional semantic models.

Page 49: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Comparative Analysis (Treo vs Treo T- Space)

ESA (Full Query Set)

Avg. Precision Avg. Recall MRR % of queries answered

0.482 0.491 0.516 58%

Wikipedia Link Measure (WLM) vs Explicit Semantic Analysis (ESA)

WLM (Full Query Set)

Avg. Precision Avg. Recall MRR % of queries answered

0.395 0. 451 0.489 56%

Improvement

% Avg. Precision % Avg. Recall % MRR % of queries answered

18% 8.2% 5.2% 3.5%

Page 50: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Comparative Analysis

Conclusion: From the two tested approaches, ESA provides a better semantic model.

Page 51: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Conclusions

The T-Space model provides a principled way to build data model intependent queries over RDF graphs.

The distributional semantic model supports a flexible matching between query terms and dataset terms in a semantic best-effort scenario.

The ESA semantic model provides a better distributional model compared to WLM.

Improvements are needed on the pre/post processing phase of the approach.

Page 52: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Associated Article

André Freitas, Edward Curry, João Gabriel Oliveira, Sean O'Riain, A Distributional Structured Semantic Space for Querying RDF Graph Data. International Journal of Semantic Computing (IJSC),2012.

http://www.worldscinet.com/ijsc/05/0504/S1793351X1100133X.html

http://andrefreitas.org/papers/preprint_distributional_structured_space.pdf

Page 53: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

Related Publications

André Freitas, Edward Curry, João Gabriel Oliveira, Sean O'Riain, Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches and Trends. IEEE Internet Computing, Special Issue on Internet-Scale Data, 2012 (Article).

André Freitas, Edward Curry, João Gabriel Oliveira, Sean O'Riain, A Distributional Structured Semantic Space for Querying

RDF Graph Data. International Journal of Semantic Computing (IJSC), 2012 (Article). André Freitas, Sean O'Riain, Edward Curry, A Distributional Approach for Terminological Semantic Search on the Linked Data

Web. 27th ACM Applied Computing Symposium, Semantic Web and Its Applications Track, 2012 (Conference Full Paper). André Freitas, João Gabriel Oliveira, Edward Curry, Sean O'Riain, A Multidimensional Semantic Space for Data Model

Independent Queries over RDF Data. In Proceedings of the 5th International Conference on Semantic Computing (ICSC), 2011. (Conference Full Paper).

André Freitas, João Gabriel Oliveira, Sean O'Riain, Edward Curry, João Carlos Pereira da Silva, Querying Linked Data using

Semantic Relatedness: A Vocabulary Independent Approach. In Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB) 2011. (Conference Full Paper).

André Freitas, João Gabriel Oliveira, Sean O'Riain, Edward Curry, João Carlos Pereira da Silva, Treo: Combining Entity-Search,

Spreading Activation and Semantic Relatedness for Querying Linked Data, In 1st Workshop on Question Answering over Linked Data (QALD-1) Workshop at 8th Extended Semantic Web Conference (ESWC), 2011 (Workshop Full Paper) .

André Freitas, João Gabriel Oliveira, Sean O'Riain, Edward Curry, João Carlos Pereira da Silva, Treo: Best-Effort Natural

Language Queries over Linked Data, In Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB), 2011 (Poster in Proceedings).

Page 54: A distributional structured semantic space for querying rdf graph data

Digital Enterprise Research Institute www.deri.ie

http://andrefreitas.org