methods for accessing permanent information and their...

18
Gianmaria Silvello Information Management Systems Research Group Department of Information Engineering University of Padua Workshop: Postdoctoral Research in Informatics 08 July 2015, Centro congressi “A. Luciani”, Padova [email protected] http://www.dei.unipd.it/~silvello/ Methods for accessing permanent information and their evaluation

Upload: others

Post on 19-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

Gianmaria SilvelloInformation Management Systems Research Group

Department of Information Engineering University of Padua

Workshop: Postdoctoral Research in Informatics

08 July 2015, Centro congressi “A. Luciani”, Padova

[email protected] http://www.dei.unipd.it/~silvello/

Methods for accessing permanent information

and their evaluation

Page 2: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Main Aspects of Interest

2

Data Model

Quality Parameter

Measure

User

Measurement

ims:expressAssessment

ims:isMeasuredBy

ims:isAssignedTo

ims:isEvaluatedBy

ims:evaluates

Run

Concept

rdfs:subClassOf

Statistic

DescriptiveStatistic

ims:isAssociatedTo

ims:isAssignedTo

rdfs:subClassOf

ims:describes

ims:isComposedBy

rdfs:subClassOf

ResourceNamespaceIdentifiableResource

rdfs:subClassOf

rdfs:subClassOf

rdfs:subClassOf

rdfs:subClassOfTrack

Evaluation Activity

ims:submittedTo

ims:consistsOf

ims:isPartOf

rdfs:subClassOf

rdfs:subClassOf

rdfs:subClassOf

Data Accessand Sharing Evaluation

Page 3: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Structure, Access, Query, Evaluation

3

Structured

Semi-Structured

Unstructured

Data

Relational database

Search/AccessParadigm

Exact-Match

Best-Match

Hybrid

Query

XPath/XQuery

SPARQL

Natural Language Keywords

Evaluation

EfficiencyTime

Space

Effectiveness

Precision

Recall

Accuracy

Page 4: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

The use of semi-structured data (XML)

4

- The use of XML is wide-spread in many sectors of everyday life

- Cultural heritage data: libraries, archives and museums

- Health data: protein sequences, pharmaceutical research

- Geographical data

- Linguistics data: Treebank, part-of-speech tagging and annotation

- Heterogeneous scientific datasets

Page 5: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

NESTOR: Set-Based Approach to Access Hierarchical Data

5

v2

v4 v5 v6

a1

a2 a3

a4 a5 a6

A1

A2

A3A4 A5

a

b

cd

e

f

g

ab

cd

e

f

g

A1A2 A3

A4

A5A6

A6

Tree

Nested Sets Model Inverse Nested Sets ModelM. Agosti, N. Ferro, and G. Silvello (2011). Handling Hierarchically Structured Resources Addressing Interoperability Issues in Digital Libraries. Studies in Computational Intelligence vol. 375, pp. 17–49. Springer

<a1>text<a2>

<a4> text </a4><a5> text </a5><a6> text </a6>

</a2><a3>

text </a3>

</a1>

XML

N. Ferro and G. Silvello (2013). NESTOR: A Formal Model for Digital Archives. Information Processing & Management (IP&M), 49(6):1206–1240, Elsevier

Page 6: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Efficient Implementations of NESTOR

6

Set-Wise and Element-Wise Query Primitives

Direct Data Structure (DDS) Inverse Data Structure (IDS) Hybrid Data Structure (HDS)

NESTOR Model Nested Sets Model (NSM) Inverse Nested Sets Model (INSM)

Descendants Ancestors Children Parent

Data Structures

XPath Axes

Structure DDS IDS HDS

Descendants O(m) O(1) O(m)

Ancestors O(1) O(m) O(m)

Parent O(1) O(m) O(1)

Children O(1) O(1) O(1)

Content DDS IDS HDS

Descendants O(1) O(m+n) O(m+n)

Ancestors O(m+n) O(1) O(m+n)

Parent O(n) O(m+n) O(n)

Children O(n) O(n) O(1)

Page 7: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello 7

Efficiency Evaluation: Space and Time

25 XPath-Based Queries over Wikipedia from INEX (à la TPC benchmark)

Execution times

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25100

101

102

103

104

105

106

107

Structure query templates

mse

c, lo

g sc

ale

INEX Evaluation Time

DDSIDSHDSXalanJaxenJXpathBaseX

Descendants Element-Wise Primitive - Collaborative Knowledge

Descendants Union Descendants Intersection Descendants

Page 8: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello 8

Average Index Building Time (Wikipedia XML files) Average Occupied Memory (Wikipedia XML file)

Efficiency Evaluation: Space and Time

Page 9: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

The Other Side of Evaluation: Effectiveness

9

SELECT name FROM hotel WHERE city=‘Padua’

I need a comfortable accommodation in Padua, Italy

- User-oriented evaluation

- From structured queries to information needs

- From set of results to ranked lists ordered by relevance

Sheraton Hotel

Methis Hotel

Best Western Premier

Toscanelli

Page 10: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Effectiveness-Oriented Evaluation

10

- Evaluation is a demanding activity carried out in international evaluation campaigns to share the effort and compare the experiments

- We designed a visual analytics tool for easing the evaluation work and reduce the required effort to carry out performance, failure and what-if analyses

M. Angelini, N. Ferro, G. Santucci, and G. Silvello (2014). A Visual Tool for Information Retrieval PerformanceEvaluation and Failure Analysis, Journal of Visual Languages and Computing, 25(4):394–413, Elsevier.

Page 11: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Effectiveness-Oriented Evaluation

11

- Most common effectiveness measures evaluate the user achieved utility

- We propose the Twist measure to evaluate the effort required to the user

N. Ferro, G. Silvello, H. Keskustalo, A. Pirkola and K. Järvelin (2015). The Twist Measure for IR Evaluation: Taking User’s Effort into Account, Journal of the Association for Information Science and Technology, John Wiley & Sons, Inc. (in print).

0 100 200 300 400 500 600 700 800 900 1000−200

0

200

400

600

800

1000

1200

Rank

CRP

0 100 200 300 400 500 600 700 800 900 100010

20

30

40

50

60

70

80

Rank

DCG

TREC 10, 2001, Web − pir1wt2, topic 504

(d) Low effort

Twist = 0.8755

nDCG = 0.8195

Excellent run - type B

0 100 200 300 400 500 600 700 800 900 1000−20000

−15000

−10000

−5000

0

5000

Rank

CRP

0 100 200 300 400 500 600 700 800 900 10000

200

400

600

800

1000

Rank

DCG

TREC 10, 2001, Web − hum01t, topic 544

(b) High effort

Twist = 0.3654

nDCG = 0.7693

Typical run - type B

(a) Huge effort

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

8000

Rank

CRP

0 100 200 300 400 500 600 700 800 900 100010

20

30

40

50

60

70

Rank

DCG

TREC 10, 2001, Web − jscbtawtl4, topic 539

Twist = 0.1899

nDCG = 0.5855

Full-scale like run

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Effort vs Gain: TREC 10, 2001, Web

Twist

nDC

G

jscbtawtl4, topic 539hum01t, topic 544jscbtawtl4, topic 544pir1wt2, topic 504

0 100 200 300 400 500 600 700 800 900 1000−2

−1

0

1

2x 104

Rank

CR

P

0 100 200 300 400 500 600 700 800 900 10000

200

400

600

800

1000

1200

Rank

DC

G

TREC 10, 2001, Web − jscbtawtl4, topic 544

Twist = 0.6435

nDCG = 0.8641

Typical run - type A

(c) Medium effort

(a) Huge effort (b) High effort (d) Low effort(c) Medium effort

Page 12: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Experimental Evaluation and Reproducibility

12

- Experimental evaluation enables repeatability, reproducibility and generalization of the experiments

N. Ferro and G. Silvello (2015). Rank-Biased Precision Reloaded: Reproducibility and Generalization. Proc of the 37th European Conference on Information Retrieval (ECIR 2015), LNCS 9022, pp. 768–780. Springer

- Experiments and findings are connected to scientific papers describing them: actionable papers

repeatability reproducibility

Page 13: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Actionable Papers

13

<a href=”http://direct.dei.unipd.it/017c333a-4b7c-4267-926d-f15fe3554efd”>

<img src=”http://direct.dei.unipd.it/017c333a-4b7c-4267-926d-f15fe3554efd/

177bcef2-00a0-4f59-b781-f285610f1c6f

Page 14: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Data Citation

14

G. Silvello (2015). A Methodology for Citing Linked Open Data Subsets, D-Lib Magazine, 21(1/2).

P. Buneman and G. Silvello (2010). A Rule-Based Citation System for Structured and Evolving Datasets. Bulletin of the Technical Committee on Data Engineering, 3(3):33–41

<Iuphar> <name>IUPHAR-DB </name> <citation>Rule0</citation> [...] <gpcr>

<name>G protein-coupled receptors</name> <citation>Rule1</citation> [...]

<family> <id>29</id>

<name>Glucagon receptor family</name> <citation>Rule2</citation> <receptor> <id>247</id> <name>GHRH</name> [...] <agonists> <ligand> [...] </ligand> </agonists> [...] </receptor> [...]

</family> [...] </gpcr> <ionchannels> [...] </ionchannels></iuphar>

iuphar[name=$.d,url=$.u, version=$.v]

iuphar[]/gpcr[name=$.n]

iuphar[]/gpcr[]/family[name=$.f,id=$.i]/contributors[]/contributor[name=$?c]

{database=$d, version=$v, contributors=$c, db-family=$n, family=$f, idFamily=$i}

Rules:

The citation that gets generated (example):{ database=IUPHAR-DB: the IUPHAR database || url=http://www.iuphar-db.org/ || version=15 || dbFamily=G protein-coupled receptors || family=Glucagon receptor family || idFamily=29 || contributor={Laurence J. Miller;;Daniel J. Drucker;;[...];;Rebecca Hills;;}}

The rules are recursively processed by the system and then transformed into a conjunction of XPaths.

The interpretation of the XPaths generates the citation.

Instantiation of the variables:

The first rule interpreted by the system

The second rule interpreted by the system

The third rule interpreted by the system

John Doe and Marco Rossi, "SystemA performances at CLEF 2009", 08 July 2014, <http://example.org/CLEF2009-systemA>

Human-readable reference

cit-sysA-CLEF2009

John Doe

Marco Rossi

2014-07-08

SystemA performances at CLEF 2009

dc:creator

dc:creatordc:date

dc:title

ex:cit-sysA-CLEF2009 dc:creator "John Doe" <http://example.org/CLEF2009-systemA>

ex:cit-sysA-CLEF2009 dc:creator "Marco Rossi <http://example.org/CLEF2009-systemA>

ex:cit-sysA-CLEF2009 dc:date "2014-07-08" <http://example.org/CLEF2009-systemA>

ex:cit-sysA-CLEF2009 dc:title "SystemA ..." <http://example.org/CLEF2009-systemA>

Machine-readable referenceSubject Property Object Name

Copyright © 2015 Gianmaria Silvello

ex:systemA

ex:expA

ex:CLEF 2009

ex:measureA

ex:produce

ex:measure ex:submitted-to

precision

0.70

ex:name

ex:value

ex:n1

ex:n2

ex:n3

ex:n4

ex:n5

schema:is-related-to

schema:is-related-to

schema:is-related-to

schema:is-related-to

ex:n1 schema:is-related-to ex:n2 ex:cit-sysA-CLEF2009

ex:n1 schema:is-related-to ex:n3 ex:cit-sysA-CLEF2009

ex:n2 schema:is-related-to ex:n4 ex:cit-sysA-CLEF2009

ex:n2 schema:is-related-to ex:n5 ex:cit-sysA-CLEF2009

Subject Property Object Name

Machine-readable citation meta-graph

ex:systemA ex:produce ex:expA ex:n1

ex:expA ex:measure ex:measureA ex:n2

ex:expA ex:submitted-to ex:CLEF2009 ex:n3

ex:measureA ex:name "precision" ex:n4

ex:measureA ex:value "0.7" ex:n5

Subject Property Object Name

Original cited LOD subsetn1

n3

n2n5

n4

Copyright © 2015 Gianmaria Silvello

Hierarchical Data (XML) Citation Graph Data (RDF) Citation

Page 15: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Future Directions

15

Data citation methods for evolving datasets

Effectiveness-oriented evaluation of keyword-based system over

structured data

Data modeling, sharing and enriching via the Linked (Open) Data paradigm

Page 16: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Selected Publications

16

- N. Ferro, G. Silvello, H. Keskustalo, A. Pirkola and K. Järvelin (2015). The Twist Measure for IR Evaluation: Taking User’s Effort into Account, Journal of the Association for Information Science and Technology, in print.

- G. Silvello (2015). A Methodology for Citing Linked Open Data Subsets, D-Lib Maga- zine, 21(1/2). DOI: 10.1045/january2015-silvello

- M. Angelini, N. Ferro, G. Santucci, and G. Silvello (2014). A Visual Tool for Information Retrieval Performance Evaluation and Failure Analysis, Journal of Visual Languages and Computing, 25(4):394–413.

- E. Di Buccio, G. Di Nunzio, and G. Silvello (2014). A Linked Open Data Approach for Geolinguistics Applications, International Journal of Metadata, Semantics and Ontologies, 9(1):29–41

- N. Ferro and G. Silvello (2013). NESTOR: A Formal Model for Digital Archives. Information Processing & Management (IP&M), 49(6):1206–1240.

- E. Di Buccio, G. Di Nunzio and G. Silvello (2013). A Curated and Evolving Linguistic Linked Dataset. Semantic Web, 4(3):265–270, P. Hitzler and K. Janowicz eds.

- P. Buneman and G. Silvello (2010). A Rule-Based Citation System for Structured and Evolving Datasets. IEEE Bulletin of the Technical Committee on Data Engineering, 2010, 3(3):33–41.

Page 17: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Selected Publications

17

- N. Ferro and G. Silvello (2015). Rank-Biased Precision Reloaded: Reproducibility and Generalization. In Proc. of the 37th European Conference on Information Retrieval (ECIR 2015), LNCS 9022, pp. 768–780. Springer.

- M. Angelini, N. Ferro, G. Santucci and G. Silvello (2015). Tutorial: Visual Analytics for Information Retrieval Evaluation (VAIRË 2015). In Proc. of the 37th European Conference on Information Retrieval (ECIR 2015), LNCS 9022, pp. 809–812. Springer.

- N. Ferro and G. Silvello (2014). CLEF 15th Birthday: What Can We Learn From Ad Hoc Retrieval? In Proc. of the Information Access Evaluation. Multilinguality, Multimodality, and Interaction - 5th International Conference of the Cross-Language Evaluation Forum (CLEF 2014), LNCS 8685, pp. 32–44. Springer.

- M. Angelini, N. Ferro, G. Santucci and G. Silvello (2014). A Visual Interactive Environment for Making Sense of Experimental Data. 36th European Conference on Information Retrieval (ECIR 2014), Lecture Notes in Computer Science 8416, pp. 767–770, Springer

- E. Di Buccio, G. M. Di Nunzio and G. Silvello (2013). A Geolinguistic Web Application Based on Linked Open Data. Proc. of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’13) pp. 1101-1102. ACM, New York, NY, USA

- N. Ferro and G. Silvello (2013). Formal Models for Digital Archives: NESTOR and the 5S. In: Proc. of the Research and Advanced Technology for Digital Libraries - International Conference on Theory and Practice of Digital Libraries (TPDL 2013), LNCS 8092, pp. 192–203. Springer

Page 18: Methods for accessing permanent information and their ...silvello/presentations/2015-DEI-website.pdf · Gianmaria Silvello Methods for Efficient Access to Semi-Structured Data slide

slideMethods for Efficient Access to Semi-Structured DataGianmaria Silvello

Selected Publications

18

- M. Angelini, N. Ferro, G. Granato, G. Santucci and G. Silvello (2012). Information retrieval failure analysis: Visual analytics as a support for interactive “what-if” investigation. In: IEEE Conference on Visual Analytics Science and Technology, VAST 2012, pp. 204-206. IEEE Computer Society, USA

- M. Angelini, N. Ferro, G. Santucci, and G. Silvello (2012). Visual Interactive Failure Analysis. In Proc. of the Fourth Information Interaction in Context Symposium (IIiX 2012). ACM Press, New York, USA

- M. Agosti, N. Ferro, and G. Silvello (2011). Handling Hierarchically Structured Re- sources Addressing Interoperability Issues in Digital Libraries. In Learning Structure and Schemas from Documents. Studies in Computational Intelligence vol. 375, pp. 17–49.

- N. Ferro and G. Silvello (2009). The NESTOR Framework: How to Handle Hierarchical Data Structures. In Proc. of the 13th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2009), LNCS 5741, pp. 215–226. Springer-Verlag.

- M. Agosti, N. Ferro and G. Silvello (2009). Access and Exchange of Hierarchically Structured Resources on the Web with the NESTOR Framework. In Proc. of the IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, 2009, pp. 659–662. IEEE Computer Society.