cbs cedar presentation

50
CEDAR From fragment to fabric - Dutch census data in a web of global cultural and historic information http://cedar-project.nl/ Ashkan Ashkpour Albert Meroño Peñuela

Upload: albert-merono-penuela

Post on 13-Jul-2015

64 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: CBS CEDAR Presentation

CEDAR

From fragment to fabric - Dutch census

data in a web of global cultural and

historic information

http://cedar-project.nl/

Ashkan Ashkpour

Albert Meroño Peñuela

Page 2: CBS CEDAR Presentation

Affiliations

Page 3: CBS CEDAR Presentation

Double purpose of Linked Census Data

Improve information retrieval for the general public (incl. lay experts, students, researchers)

Create new data sources and possibly new research practices in social history, historic demography, general history, …

Create immediate access to digitized Dutch census data

Semantic modeling of Dutch census data

Further enriching of statistical information with context

Page 4: CBS CEDAR Presentation

The Historical Census Use Case

2011-2015

Page 5: CBS CEDAR Presentation

Historical Censuses

Page 6: CBS CEDAR Presentation

Historical Censuses

Page 7: CBS CEDAR Presentation

End of the door-to-door

censuses

Page 8: CBS CEDAR Presentation

• Source of historical statistical data, providing a

rich source of social, economic and

demographic data

• A relatively untapped source of information

information / Most research focuses on a

specific year or a subset instead of time series

• “..specific information about a nations population

characteristics and needs at a given time in

history, providing invaluable snapshots of the

state of a nation”

Page 9: CBS CEDAR Presentation

Digitization Efforts – 1996

Cooperation between CBS and NIWI

Page 10: CBS CEDAR Presentation

Conversion Process

Page 11: CBS CEDAR Presentation

Conversion Process

Page 12: CBS CEDAR Presentation

Conversion Process

Page 13: CBS CEDAR Presentation

VT 1869 Plaatselijke indeling NB

Page 14: CBS CEDAR Presentation

The census dataset (1795-

1971)

3 Types of Censuses

[Population/Occupation/Housing]

17 census years

Only left aggregated form

2288 census tables

33283 annotations

17Million characters

Page 15: CBS CEDAR Presentation

Conversion Process

Page 16: CBS CEDAR Presentation

Structural Heterogeneity

How can we maintain the same structure

and information

1 on 1 representation

Page 17: CBS CEDAR Presentation

Using the Layout

Page 18: CBS CEDAR Presentation

Going to RDF

Model: RDF Data cube > Multi dimensional

statistical data

Supervised conversion

Need to define the layout structures per table

Page 19: CBS CEDAR Presentation

Going to RDF

Styling of 2,288 tables

Training and conversion at DANS

Thanks to Michael and Jetske

Page 20: CBS CEDAR Presentation
Page 21: CBS CEDAR Presentation

RDF Statistics

310,585,567 total triples

389,132 hierarchical row headers

17,960,911 data cells

61,110 column headers

3,609 row properties

3,150 titles

1,581,546 row headers

274,404 metadata cells

See http://lod.cedar-project.nl/cedar/data.html

Page 22: CBS CEDAR Presentation

Everything in one system: What does this

mean ?

No separate files

Insights in # of variables

Availability of variables (preliminary analyses)

Straightforward harmonizations

Systematic data check

Visualizations

Other debugging purposes

Page 23: CBS CEDAR Presentation

Examples on raw data

Page 24: CBS CEDAR Presentation

Examples on raw data

Number of teachers Number of married women

Page 25: CBS CEDAR Presentation

Three tier model

Raw data is filled

Annotations

Harmonization layer

Page 26: CBS CEDAR Presentation

Enriching the Harmonization layer

Cleaning and correcting

Standardizing variables and values..

Mappings

Connecting to existing (classifications) systems:

HISCO (historical occupations)

Amsterdam Code (historical municipalities)

SDMX (demographical variables)

Creating variables, bottom up classification systems (religious denominations, housing types, occupations, age ranges..

Key: bringing all these practices together

Page 27: CBS CEDAR Presentation

CEDAR goal: cross-query the Dutch historical censuses on the Web

?

1795 1830 1889 1930 1971

(aka integrating

~3K disparate

tables)

Page 28: CBS CEDAR Presentation

• Web publishable

• Machine processable

• Dynamic schema

• Easily link with other

datasets

Page 29: CBS CEDAR Presentation

Why Semantic Web

Technology? To W3C

Web publishable

Web exchangeable

Human & machine readable

Provide interesting links

To us

Finer granularity level (cell level)

Statistical comparability by leveraging semantic

descriptions

Provenance

Harmonization through linkage to other datasets

Page 30: CBS CEDAR Presentation

Towards 5-star Census Data

Page 31: CBS CEDAR Presentation

Towards 5-star Census Data

>2 years ago

2 years ago

Page 32: CBS CEDAR Presentation

“There are many situations where it

would be useful to be able to publish

multi-dimensional data, such as

statistics, on the web in such a way

that they can be linked to related data

sets and concepts.”

Page 33: CBS CEDAR Presentation

RDF Data Cube vocabulary

(QB)

Page 34: CBS CEDAR Presentation

RDF Data Cube vocabulary

(QB)

Page 35: CBS CEDAR Presentation

RDF Data Cube vocabulary

(QB)• SDMX compatible

• Defines cubes as a set of observations that consist

of dimensions, measures and attributes

• Dimensions: time period, region, sex (qb:DimensionProperty)

• Measure: population life expectancy (qb:MeasureProperty)

• Attribute: unit of measure = years, metadata status =

measured (qb:AttributeProperty)

Observation: “the measured life expectancy of males in

Newport in the period 2004-2006 is 76.7 years”

Page 36: CBS CEDAR Presentation

CEDAR Integrator

https://github.com/CEDAR-

project/Integrator

http://lod.cedar-project.nl/cedar/data.html

Page 37: CBS CEDAR Presentation

http://lod.cedar-project.nl/cedar/stats.html

Page 38: CBS CEDAR Presentation

http://lod.cedar-project.nl/maps/

Page 39: CBS CEDAR Presentation

Dimension Reusability

cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;

cedar:population "12"^^xml:integer ;

maritalstatus:maritalStatus

maritalstatus:single ;

cedarterms:occupationPosition cedarterms:job-D ;

sdmx-dimension:sex sdmx-code:sex-F ;

cedarterms:occupation hisco:88030 ;

sdmx-dimension:refArea gg:11150 ;

cedarterms:belief hreligion:118 ;

cedarterms:houseType cedar:Klooster ;

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-K17-activity .

Page 40: CBS CEDAR Presentation

Dimension Reusability

cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;

cedar:population "12"^^xml:integer ;

maritalstatus:maritalStatus

maritalstatus:single ;

cedarterms:occupationPosition cedarterms:job-D ;

sdmx-dimension:sex sdmx-code:sex-F ;

cedarterms:occupation hisco:88030 ;

sdmx-dimension:refArea gg:11150 ;

cedarterms:belief hreligion:118 ;

cedarterms:houseType cedar:Klooster ;

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-K17-activity .

Page 41: CBS CEDAR Presentation

Dimension Reusability

cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;

cedar:population "12"^^xml:integer ;

maritalstatus:maritalStatus

maritalstatus:single ;

cedarterms:occupationPosition cedarterms:job-D ;

sdmx-dimension:sex sdmx-code:sex-F ;

cedarterms:occupation hisco:88030 ;

sdmx-dimension:refArea gg:11150 ;

cedarterms:belief hreligion:118 ;

cedarterms:houseType cedar:Klooster ;

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-K17-activity .

Page 42: CBS CEDAR Presentation

LSD Dimensions

http://lsd-dimensions.org/https://github.com/albertmeronyo/LSD-Dimensions

Hourly JSON-LD dumps

Page 43: CBS CEDAR Presentation

What if dimensions aren’t out

there?

Need to build them

Input: flat lists of non-standard values

Output: standard concept scheme

Knowledge intensive problem

https://github.com/CEDAR-project/TabCluster

Page 44: CBS CEDAR Presentation

Concept Drift

Census classification of

occupations as for

1859

• Root node is void

• Depth 1: occupation groups

• Leaves: actual occupations

Page 45: CBS CEDAR Presentation

Concept Drift

Census classification of

occupations as for

1889

• Root node is void

• Depth 1: occupation groups

• Leaves: actual occupations

Page 46: CBS CEDAR Presentation

Census classification of

occupations as for

1899

• Root node is void

• Depth 1: occupation groups

• Leaves: actual occupations

Concept Drift

Page 47: CBS CEDAR Presentation

RQ: Can we use past knowledge to predict

when and where will concept drift happen in an

ontology?

Theoretical framework: [1]

Data: a number of ontology versions

Method: supervised learning [2]

Features: structural, membership, usage [3]

Results: f-measures of 0.84, 0.93, 0.79

https://github.com/albertmeronyo/ConceptDrift[1] Shenghui Wang, Stefan Schlobach, Michael Klein. “What is Concept Drift and How to Identify It?”. EKAW

2010.[2] Pesquita C, Couto FM (2012) Predicting the Extension of Biomedical Ontologies. PLoS Comput Biol 8(9):

e1002630.[3] Ljiljiana Stojanovic. “Methods and Tools for Ontology Evolution” (2004).

Concept Drift

Page 48: CBS CEDAR Presentation

Compatibility? Remixability?

Reusability?

Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic

Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on

Semantic Statistics (SemStats) ISWC 2014.

Page 49: CBS CEDAR Presentation

Summary

RDF Data Cube: publishing and integrating multi-dimensional data in the Semantic Web

Dutch historical censuses (increasingly) published and queryable online

Discoverabililty, reusability and remixability of dimensions is important

Bottom-up concept scheme generation only semi-automatable

Concept drift (or concept stability) can be predicted accurately if enough historical data is available

Semantic representations can provide insight in statistical correlation

Page 50: CBS CEDAR Presentation

CEDAR Integrator

https://github.com/CEDAR-project/Integrator

LSD Dimensions

http://lsd-dimensions.org/

TabCluster

https://github.com/CEDAR-project/TabCluster

Concept Drift

https://github.com/albertmeronyo/ConceptDrift

Semantic Correlation

http://csarven.ca/sense-of-lsd-analysis

http//www.cedar-project.nl/

Thank you