quality views: capturing and exploiting the user perspective on information quality

26
Combining the strengths of UMIST and The Victoria University of Manchester Quality views: capturing and exploiting the user perspective on information quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science, University of Manchester Alun Preece, Binling Jin Department of Computing Science, University of Aberdeen www.qurator.org Describing the Quality of Curated e-Science Information Resources

Upload: tuari

Post on 18-Mar-2016

41 views

Category:

Documents


1 download

DESCRIPTION

Paolo Missier , Suzanne Embury, Mark Greenwood School of Computer Science, University of Manchester Alun Preece, Binling Jin Department of Computing Science, University of Aberdeen. Quality views: capturing and exploiting the user perspective on information quality. www.qurator.org - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality views: capturing and exploiting the user perspective on information quality

Paolo Missier, Suzanne Embury, Mark GreenwoodSchool of Computer Science, University of Manchester

Alun Preece, Binling JinDepartment of Computing Science, University of Aberdeen

www.qurator.orgDescribing the Quality of Curated e-Science Information

Resources

Page 2: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Outline• Information and information quality (IQ) in e-science

• Quality views: a quality lens on data

• Semantic model for IQ

• Architectural framework for quality views

• State of the project and current research

Page 3: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Information and quality in e-science

• Scientists are increasingly required to place more of their data in the public domain

• Scientists use other scientists' experimental results as part of their own work

In silico experiments(eg Workflow-based)

Lab experiment

In silico experiments(eg Workflow-based)

Public BioDBsPublic

BioDBsE-science experiment

Can I trust this data?What evidence do I

have that it is suitable for my experiment?

• Variations in the quality of the data being shared

• Scientists have no control over the quality of public data

• Lack of awareness on quality: difficult to measure and assess– No standards!

Page 4: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

A concrete scenarioQualitative proteomics: identification of proteins in a cell sample

Step 1 Step nCandidate Data

for matching(peptides peak lists)

Match algorithm

Reference DBs- MSDB- NCBI- SwissProt/Uniprot

Wet lab

Information service (“Dry lab”)

Hit list:{ID, score, p-value,…}

False negatives: incompleteness of reference DBs, pessimistic matching

False positives: optimistic matching

Page 5: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

The complete in silico workflow 1: identify proteins; 2: analyze their functions

What is the quality of this processor’s output?

Is the processor introducing noise in the flow?GO = Gene Ontology

Reference controlled vocabulary for describing protein function (and more)

How can a user rapidly test this and other hypotheses on quality?

Page 6: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

The users’ perception of quality

Scientists often have only a blurry notionof their quality requirements for the data

“One size fits-all” approach to quality does not work– Scientists tend to apply personal acceptability criteria to data

– Driven mostly by prior personal and peers’ experience

– Based on the expected use of the data• What levels of false positives / negatives are acceptable?

It is difficult for users to implement quality criteria and test them on the data

Page 7: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality views: making quality explicitOur goals:

• To support groups of users within a (scientific) community in understanding information quality on specific data domains

• To foster reuse of quality definitions within the community

Approach:

• Provide a conceptual model and architectural framework to capture user preferences on data quality

• Let users populate the framework with custom definitions for indicators and personal decision criteria

– The framework allows uses to rapidly test quality preferences and observe their effect on the data

– Semi-automated integration in the data processing environment

Quality views:A specification of quality preferences and how they apply to the data

Page 8: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Basic elements of information quality

1 - Quality dimensions:

A basic set of generic definitions for well-known non-functional properties of the data• Ex. Accuracy: describes “how close the observed value is to the actual value”

2- Quality evidence:

• Any measurable quantities that can be used to express formal quality criteria

• Evidence is not by itself a measure of quality

Ex. “Hit ratio in protein identification”

3- Quality assertions:Decision procedures for data acceptability, based on available evidence

Page 9: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

The nature of quality evidenceDirect evidence: indicators that represent some quality

property– Algorithms may exist to determine the biological plausibility of an

experiment’s outcome– may be costly, not always available, and possibly inconclusive

Indirect evidence: inexpensive indicators that correlate with other more expensive indicators– Eg some function of “hit ratio” and “sequence coverage”– Need experimental evidence of the correlation

Goals:design suitable functions to collect / compute evidenceassociate evidence to data (data quality annotation)

Page 10: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Generic (e-science) evidence• recency: how recently the experiment was performed, or

its results published– Evidence: submission, publication dates

• submitter reputation: is the lab well-known for its accuracy in carrying out this type of experiments– Metric: lab ranking (subjective)

• publications prestige: are the experiment results presented in high-profile journal publications– Metric: Impact Factor and more (official)

Collecting data provenance is the key to providing most of these types of evidence

Page 11: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Semantic model for Information Quality

The key IQ concepts are captured using an ontology:

• Provides shareable, formal definitions for– QualityProperties (“dimensions”)

– Quality Evidence– Quality Assertions– DataAnalysisTools: Describe how indicators are computed

• The ontology is implemented in OWL DL– Expressive operators for defining concepts and their relationships

– Support for subsumption reasoning

Page 12: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Domain-specificUser-orientedConcrete qualities

Wang and Strong, Beyond Accuracy: What Data Quality Means to Data Consumers, Journal of Management Information Systems, 1996

Top-level taxonomy of quality dimensions

Genericdimensions

Page 13: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Main taxonomies and properties

Class restriction:MassCoverage is-evidence-for . ImprintHitEntry

Class restriction:PIScoreClassifier assertion-based-on-evidence . MassPIScoreClassifier assertion-based-on-evidence . Coverage

assertion-based-on-evidence: QualityAssertion QualityEvidence

is-evidence-for: QualityEvidence DataEntity

Page 14: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Associating evidence to data• Annotation functions compute quality evidence values for

datasets and associate them to the data– Defined in the DataAnalysisTool taxonomy as part of the ontology

Page 15: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality assertions

Defined as ranking or classification functions f(D,I):Input: • dataset D

• vector I = [I1,I2,…In] of indicator valuesPossible outputs:

• A classification {(d,ci)} for each d D

• A ranking {(d,ri)} for each d DThe classification scheme C = {c1,..ck} and the ranking interval [r,R] are

themselves defined in the ontology

Assertions formalize the user’s bias on evidence as computable decision models on that evidence

Example:

PIScoreClassifier partitions the input dataset into three classes {low, avg, high} based on a function of [HitScore, MassCoverage]

Page 16: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Quality views in practiceQuality views are declarative specifications for:

• desired data classification models and evidence– I = [I1,I2,…In]

– classi(d), ranki(d) for all d D

• condition-action pairs, eg:• If <condition on class(d), rank(d), I> then <action>

• Where <action> depends on the data processing environment

– Filter out d

– Highlight d in a viewer

– Send d to a designated process or repository

– …

• Quality views are based on a small set of formal operators• They are expressed using an XML syntax

Page 17: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Execution model for Quality views• QVs can be embedded within specific data management

host environments for runtime execution– For static data: a query processor

– For dynamic data: a workflow engine

Host environment

Declarative(XML) QV

EmbeddedExecutable

QV

QV compiler

Dataset D

D’

Qurator quality framework- Quality assertion services

Quality view on D’

Page 18: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

User model

Compose quality view

IQ ontologyCompile

and deploy

Execute on test data

AssessView results

(Update assertion models)

Quality assertion services

Re-deploy

bindings

(XML)

Implementing rapid testing of quality hypotheses:

Page 19: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

The Qurator quality framework

Page 20: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Compiled quality workflow

Page 21: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Embedded quality workflow

Page 22: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Example effect of QV: noise reduction

Page 23: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Summary

• A conceptual model and architecture for capturing the user’s perception on information quality

– Formal, semantic model makes concepts• Shareable• Reusable• Machine-processable

• Quality views are user-defined and compiled to data processing environments (possibly multiple)

• The Qurator framework supports a runtime model for QVs

• Current work:– Formal semantics for QVs– Exploiting semantic technology to support the QV specification task– Addressing more real use cases

Main paradigm: let scientists experiment with quality concepts in an easy and intuitive way by observing the effect of their personal bias

Page 24: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Page 25: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester

Page 26: Quality views: capturing and exploiting the user perspective on information quality

Combining the strengths of UMIST andThe Victoria University of Manchester