metadata quality assurance framework at qqml2016 conference - full version

58
Metadata Quality Assurance Framework Péter Király <[email protected]> Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany QQML2016 8 th International Conference on Qualitative and Quantitative Methods in Libraries 2016-05-24, London

Upload: peter-kiraly

Post on 14-Jan-2017

619 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Metadata Quality Assurance Framework at QQML2016 conference - full version

Metadata Quality Assurance Framework

Péter Király <[email protected]>Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen, Germany

QQML2016 8th International Conference on Qualitative and Quantitative Methods in Libraries2016-05-24, London

Page 2: Metadata Quality Assurance Framework at QQML2016 conference - full version

2

Metadata Quality Assurance Framework

the problemthere are „good” and „bad” metadata

records

Page 3: Metadata Quality Assurance Framework at QQML2016 conference - full version

3

Metadata Quality Assurance Framework

Typical issues – non-informative field

Title is not informative

non informative:„photograph, framed”,„group photograph”„photograph”

vs

informative:„Photograph of Sir Dugald Clerk”,„Photograph of "Puffing Billy"

Page 4: Metadata Quality Assurance Framework at QQML2016 conference - full version

4

Metadata Quality Assurance Framework

Typical issues – Copy & paste cataloging

Keeping placeholders / templates

Page 5: Metadata Quality Assurance Framework at QQML2016 conference - full version

5

Metadata Quality Assurance Framework

Typical issues – Field overuse

What is the meaning of the field? (overuse)

TextGrid OAI-PMH response

Page 6: Metadata Quality Assurance Framework at QQML2016 conference - full version

6

Metadata Quality Assurance Framework

Why data quality is important?

„Fitness for purpose” (QA principle)

no metadata no access to data no data usage

more explanation:Data on the Web Best PracticesW3C Working Draft 19 May 2016https://www.w3.org/TR/dwbp/

Page 7: Metadata Quality Assurance Framework at QQML2016 conference - full version

7

Metadata Quality Assurance Framework

Europeana Data Quality Committee

Online collaboration Use case documents Problem catalog Tickets Discussion forum #EuropeanaDataQuality

Bi-weekly teleconf Bi-yearly face-to-face

meeting

Topics Usage scenarios Metadata profiles Schema modification Measuring Event model Proposals for data

providers

Page 8: Metadata Quality Assurance Framework at QQML2016 conference - full version

8

Metadata Quality Assurance Framework

Research hypothesis

hypothesiswith measuring structural elements we

can predict metadata record quality

Page 9: Metadata Quality Assurance Framework at QQML2016 conference - full version

9

Metadata Quality Assurance Framework

What it is good for?

improve the metadata improve services: good data → functions improve metadata schema &

documentation propagate „good practice”

Domains: cultural heritage sector research data management and

archiving

Page 10: Metadata Quality Assurance Framework at QQML2016 conference - full version

10

Metadata Quality Assurance Framework

Research hypothesis

proposed solutionMetadata Quality Assurance Framework

Page 11: Metadata Quality Assurance Framework at QQML2016 conference - full version

11

Metadata Quality Assurance Framework

What to measure?

Page 12: Metadata Quality Assurance Framework at QQML2016 conference - full version

12

Metadata Quality Assurance Framework

Measurements

Schema-independent structural featuresexistence, cardinality, uniqueness,

length,dictionary entry, data type conformance

Use case scenarios („fit for purpose”)Requirements of the most important

functions

Problem catalogKnown metadata problems

Page 13: Metadata Quality Assurance Framework at QQML2016 conference - full version

13

Metadata Quality Assurance Framework

Discovery scenarios and their metadata requirements

Europeana’s most important functions

1. Basic retrieval with high precision and recall2. Cross-language recall3. Entity-based facets4. Date-based facets5. Improved language facets6. Browse by subjects and resource types7. Browse by agents8. Browse/Search by Event9. Entity-based knowledge cards and pages10. Categorised similar items11. Spatial search, browse, and map display12. Entity-based autocompletion13. Diversification of results14. Hierarchical search and facets

Credit: the document was initialized by Timothy Hill, Europeana’s search engineer

Page 14: Metadata Quality Assurance Framework at QQML2016 conference - full version

14

Metadata Quality Assurance Framework

Discovery scenarios and their metadata requirements – Entity-based facets

ScenarioAs a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc.

Metadata analysisIn each case the underlying requirement is that the relevant EDM fields for objects be populated by identifying URIs rather than free text. These URIs need to be related, at a minimum, to a label for each of the supported languages.

Measurement rules The relevant field values should be resolvable URI each URI should have labels in multiple languages

Page 15: Metadata Quality Assurance Framework at QQML2016 conference - full version

15

Metadata Quality Assurance Framework

Discovery scenarios and their metadata requirements – Date-based facets

ScenarioI want to be able to filter my results by a variety of timespans, e.g.: Date of creation Date of publication Date as subject

Metadata analysisDates should be fully and consistently normalised to follow the XSD date-time data types. Dates expressed in styles like “490 avant J.C” that are inherently language dependent should be avoided as they’re very difficult to normalise (e.g. this should be represented as “-0490”^^xsd:gYear).

Measurement rules Field value should be XSD date-time data types

Page 16: Metadata Quality Assurance Framework at QQML2016 conference - full version

16

Metadata Quality Assurance Framework

Problem catalog

Catalog of known metadata problems in Europeana

Title contents same as description contents Systematic use of the same title Bad string: "empty" (and variants) Shelfmarks and other identifiers in fields Creator not an agent name Absurd geographical location Subject field used as description field Unicode U+FFFD ( )� Very short description field ...

Credit: the document was initialized by Timoty Hill, Europeana’s search engineer

Page 17: Metadata Quality Assurance Framework at QQML2016 conference - full version

17

Metadata Quality Assurance Framework

Problem catalog

Description Title contents same as description contentsExample /2023702/35D943DF60D779EC9EF31F5DF...Motivation Distorts search weightingsChecking Method Field comparisonNotes Record display: creator concatenated onto titleMetadata Scenario Basic Retrieval

Page 18: Metadata Quality Assurance Framework at QQML2016 conference - full version

18

Metadata Quality Assurance Framework

How to define measurements?

Page 19: Metadata Quality Assurance Framework at QQML2016 conference - full version

19

Metadata Quality Assurance Framework

Problem catalog – proposed basis of implementation

Shapes Constraint Language (SHACL)https://www.w3.org/TR/shacl/

A language for describing and constraining the contents of RDF graphs. It provides a high-level vocabulary to identify predicates and their associated cardinalities, datatypes and other constraints.

sh:equals, sh:notEquals sh:hasValue sh:in sh:lessThan, sh:lessThanOrEquals sh:minCount, sh:maxCount sh:minLength, sh:maxLength sh:pattern

Page 20: Metadata Quality Assurance Framework at QQML2016 conference - full version

20

Metadata Quality Assurance Framework

early measurement resultsand their visualization

Page 21: Metadata Quality Assurance Framework at QQML2016 conference - full version

21

Metadata Quality Assurance Framework

overall view collection view record view

Completeness – 40 measurementsField cardinality – 27 measurementsUniqueness – 6 measurementsLanguage specification – 20 measurementsProblem catalog – 3 measurementsetc.

links

measurementsaggregated numbers

Page 22: Metadata Quality Assurance Framework at QQML2016 conference - full version

22

Metadata Quality Assurance Framework

completenessWhat is the ratio of populated fields in

records?

Page 23: Metadata Quality Assurance Framework at QQML2016 conference - full version

23

Metadata Quality Assurance Framework

Field frequency / main

Page 24: Metadata Quality Assurance Framework at QQML2016 conference - full version

24

Metadata Quality Assurance Framework

Field frequency / main

Alternative title is a rare field

Page 25: Metadata Quality Assurance Framework at QQML2016 conference - full version

25

Metadata Quality Assurance Framework

Field frequency per collections / all

no record has alternative title

every record has alternative title

Page 26: Metadata Quality Assurance Framework at QQML2016 conference - full version

26

Metadata Quality Assurance Framework

Field frequency per collections / remove no-instances

Page 27: Metadata Quality Assurance Framework at QQML2016 conference - full version

27

Metadata Quality Assurance Framework

Field frequency per collections / display only complete collections

Page 28: Metadata Quality Assurance Framework at QQML2016 conference - full version

28

Metadata Quality Assurance Framework

cardinalityHow many field instances are in the

records?

Page 29: Metadata Quality Assurance Framework at QQML2016 conference - full version

29

Metadata Quality Assurance Framework

Field cardinality – overview

more field than record

number of records

Page 30: Metadata Quality Assurance Framework at QQML2016 conference - full version

30

Metadata Quality Assurance Framework

Field cardinality – overview

dc:type

Page 31: Metadata Quality Assurance Framework at QQML2016 conference - full version

31

Metadata Quality Assurance Framework

Field cardinality – histogram

128 subjects in one record

median is 0, mean is close to 1

link to interesting records

Page 32: Metadata Quality Assurance Framework at QQML2016 conference - full version

32

Metadata Quality Assurance Framework

Field cardinality – an outlier

Page 33: Metadata Quality Assurance Framework at QQML2016 conference - full version

33

Metadata Quality Assurance Framework

multilingualityDo we know the language of a field

value?

Page 34: Metadata Quality Assurance Framework at QQML2016 conference - full version

34

Metadata Quality Assurance Framework

Multilinguality

@resource is a URI

@ = language notation in RDF

no language specification

Page 35: Metadata Quality Assurance Framework at QQML2016 conference - full version

35

Metadata Quality Assurance Framework

Language frequency / barchart

Page 36: Metadata Quality Assurance Framework at QQML2016 conference - full version

36

Metadata Quality Assurance Framework

Language frequency / barchart

same language, different encodings

Page 37: Metadata Quality Assurance Framework at QQML2016 conference - full version

37

Metadata Quality Assurance Framework

Language frequency / Treemap

has language specification

has no language specification

Page 38: Metadata Quality Assurance Framework at QQML2016 conference - full version

38

Metadata Quality Assurance Framework

Language frequency / Treemap with resources

has no language specification

has language specificationIs a URI

Page 39: Metadata Quality Assurance Framework at QQML2016 conference - full version

39

Metadata Quality Assurance Framework

Language frequency / Treemap + interaction + table

hide/display categories

table-like formal

Page 40: Metadata Quality Assurance Framework at QQML2016 conference - full version

40

Metadata Quality Assurance Framework

uniqueness (entropy)How unique the terms are in a field?

Page 41: Metadata Quality Assurance Framework at QQML2016 conference - full version

41

Metadata Quality Assurance Framework

Entropy – term uniqueness / main

1 means a unique term0.0000x means a very frequent term

These are cumulative numbersentropycumolative = term1 + ... + termn

Page 42: Metadata Quality Assurance Framework at QQML2016 conference - full version

42

Metadata Quality Assurance Framework

Entropy – term uniqueness / collection

max is exceptional (=1425 * mean)

unique records

not or less unique records

Page 43: Metadata Quality Assurance Framework at QQML2016 conference - full version

43

Metadata Quality Assurance Framework

Entropy – term uniqueness / refining the picture

bulk of records are close to zero

although 25% are between 0.05 and 1.25

Page 44: Metadata Quality Assurance Framework at QQML2016 conference - full version

44

Metadata Quality Assurance Framework

Entropy – term uniqueness / field value

Russian text in transcribed Latin writing szstem, not in Cyrillic

Page 45: Metadata Quality Assurance Framework at QQML2016 conference - full version

45

Metadata Quality Assurance Framework

Entropy – term uniqueness / terms

explanation of uniqueness score

TF-IDF values come from Apache Solr

term frequency: 1document freq.: 2uniqueness score: 0.5

Page 46: Metadata Quality Assurance Framework at QQML2016 conference - full version

46

Metadata Quality Assurance Framework

problem catalogDoes the record have any specific issues?

Page 47: Metadata Quality Assurance Framework at QQML2016 conference - full version

47

Metadata Quality Assurance Framework

Problem catalog – Long subject

a record with 265 „long” subject heading

Page 48: Metadata Quality Assurance Framework at QQML2016 conference - full version

48

Metadata Quality Assurance Framework

Problem catalog – Long subject – example (not so long...)

Conclusion: we have to refine the definition of „long”

Page 49: Metadata Quality Assurance Framework at QQML2016 conference - full version

49

Metadata Quality Assurance Framework

Problem catalog – same title and description

there is one title and description which is the same

... and we have 9 such records

Page 50: Metadata Quality Assurance Framework at QQML2016 conference - full version

50

Metadata Quality Assurance Framework

Problem catalog – same title and description – example

Page 51: Metadata Quality Assurance Framework at QQML2016 conference - full version

51

Metadata Quality Assurance Framework

completeness sub-dimensionsAre the sub-dimensions (field groups supporting specific functionalities)

complete?

Page 52: Metadata Quality Assurance Framework at QQML2016 conference - full version

52

Metadata Quality Assurance Framework

Record view – functionality matrix

existing

missing

functionalities

Page 53: Metadata Quality Assurance Framework at QQML2016 conference - full version

53

Metadata Quality Assurance Framework

miscellaneous

Page 54: Metadata Quality Assurance Framework at QQML2016 conference - full version

54

Metadata Quality Assurance Framework

Other elements of the record view

Page 55: Metadata Quality Assurance Framework at QQML2016 conference - full version

55

Metadata Quality Assurance Framework

Further steps

Incorporating into Europeana’s ingestion tool Process usage statistics (logs, Google Analitics) Human evaluation of metadata quality Measuring timeliness (changes of scores over time) Machine learning based classification & clustering Incorporating into research data management tool Cooperation with other projects

Page 56: Metadata Quality Assurance Framework at QQML2016 conference - full version

56

Metadata Quality Assurance Framework

Project principles

Scalable, ready for big data Loose coupling to metadata schemas Transparency: open source, open data

(CC0) Release early, release often Getting real [1] Collaboration and communication[1] https://gettingreal.37signals.com/

Page 57: Metadata Quality Assurance Framework at QQML2016 conference - full version

57

Metadata Quality Assurance Framework

Architectural overview

Apache Spark (Java)

OAI-PMH client (PHP)

Analysis with Spark (Scala) Analysis with R

Web interface(PHP, d3.js)

Hadoop File System

JSON files

Apache Solr

Apache Cassandra

JSON filesJSON files image files

CSV files CSV files

recent workflowplanned workflow

Page 58: Metadata Quality Assurance Framework at QQML2016 conference - full version

58

Metadata Quality Assurance Framework

Follow me

Europeana Data Quality Committee http://pro.europeana.eu/europeana-tech/data-quality-committee

research plan and blog http://pkiraly.github.io

site http://144.76.218.178/europeana-qa/

source codes https://github.com/pkiraly/europeana-qa-spark https://github.com/pkiraly/europeana-qa-r

@kiru, https://www.linkedin.com/in/peterkiraly