cbs cedar presentation

From fragment to fabric - Dutch census

data in a web of global cultural and

historic information

http://cedar-project.nl/

Ashkan Ashkpour

Albert Meroño Peñuela

Affiliations

Double purpose of Linked Census Data

Improve information retrieval for the general public (incl. lay experts, students, researchers)

Create new data sources and possibly new research practices in social history, historic demography, general history, …

Create immediate access to digitized Dutch census data

Semantic modeling of Dutch census data

Further enriching of statistical information with context

The Historical Census Use Case

2011-2015

Historical Censuses

End of the door-to-door

censuses

• Source of historical statistical data, providing a

rich source of social, economic and

demographic data

• A relatively untapped source of information

information / Most research focuses on a

specific year or a subset instead of time series

• “..specific information about a nations population

characteristics and needs at a given time in

history, providing invaluable snapshots of the

state of a nation”

Digitization Efforts – 1996

Cooperation between CBS and NIWI

Conversion Process

VT 1869 Plaatselijke indeling NB

The census dataset (1795-

3 Types of Censuses

[Population/Occupation/Housing]

17 census years

Only left aggregated form

2288 census tables

33283 annotations

17Million characters

Conversion Process

Structural Heterogeneity

How can we maintain the same structure

and information

1 on 1 representation

Using the Layout

Going to RDF

Model: RDF Data cube > Multi dimensional

statistical data

Supervised conversion

Need to define the layout structures per table

Going to RDF

Styling of 2,288 tables

Training and conversion at DANS

Thanks to Michael and Jetske

RDF Statistics

310,585,567 total triples

389,132 hierarchical row headers

17,960,911 data cells

61,110 column headers

3,609 row properties

3,150 titles

1,581,546 row headers

274,404 metadata cells

See http://lod.cedar-project.nl/cedar/data.html

Everything in one system: What does this

mean ?

No separate files

Insights in # of variables

Availability of variables (preliminary analyses)

Straightforward harmonizations

Systematic data check

Visualizations

Other debugging purposes

Examples on raw data

Number of teachers Number of married women

Three tier model

Raw data is filled

Annotations

Harmonization layer

Enriching the Harmonization layer

Cleaning and correcting

Standardizing variables and values..

Mappings

Connecting to existing (classifications) systems:

HISCO (historical occupations)

Amsterdam Code (historical municipalities)

SDMX (demographical variables)

Creating variables, bottom up classification systems (religious denominations, housing types, occupations, age ranges..

Key: bringing all these practices together

CEDAR goal: cross-query the Dutch historical censuses on the Web

1795 1830 1889 1930 1971

(aka integrating

~3K disparate

tables)

• Web publishable

• Machine processable

• Dynamic schema

• Easily link with other

datasets

Why Semantic Web

Technology? To W3C

Web publishable

Web exchangeable

Human & machine readable

Provide interesting links

Finer granularity level (cell level)

Statistical comparability by leveraging semantic

descriptions

Provenance

Harmonization through linkage to other datasets

Towards 5-star Census Data

>2 years ago

2 years ago

“There are many situations where it

would be useful to be able to publish

multi-dimensional data, such as

statistics, on the web in such a way

that they can be linked to related data

sets and concepts.”

RDF Data Cube vocabulary

(QB)• SDMX compatible

• Defines cubes as a set of observations that consist

of dimensions, measures and attributes

• Dimensions: time period, region, sex (qb:DimensionProperty)

• Measure: population life expectancy (qb:MeasureProperty)

• Attribute: unit of measure = years, metadata status =

measured (qb:AttributeProperty)

Observation: “the measured life expectancy of males in

Newport in the period 2004-2006 is 76.7 years”

CEDAR Integrator

https://github.com/CEDAR-

project/Integrator

http://lod.cedar-project.nl/cedar/data.html

http://lod.cedar-project.nl/cedar/stats.html

http://lod.cedar-project.nl/maps/

Dimension Reusability

cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;

cedar:population "12"^^xml:integer ;

maritalstatus:maritalStatus

maritalstatus:single ;

cedarterms:occupationPosition cedarterms:job-D ;

sdmx-dimension:sex sdmx-code:sex-F ;

cedarterms:occupation hisco:88030 ;

sdmx-dimension:refArea gg:11150 ;

cedarterms:belief hreligion:118 ;

cedarterms:houseType cedar:Klooster ;

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-K17-activity .

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

LSD Dimensions

http://lsd-dimensions.org/https://github.com/albertmeronyo/LSD-Dimensions

Hourly JSON-LD dumps

What if dimensions aren’t out

there?

Need to build them

Input: flat lists of non-standard values

Output: standard concept scheme

Knowledge intensive problem

https://github.com/CEDAR-project/TabCluster

Concept Drift

Census classification of

occupations as for

• Root node is void

• Depth 1: occupation groups

• Leaves: actual occupations

Concept Drift

occupations as for

Concept Drift

RQ: Can we use past knowledge to predict

when and where will concept drift happen in an

ontology?

Theoretical framework: [1]

Data: a number of ontology versions

Method: supervised learning [2]

Features: structural, membership, usage [3]

Results: f-measures of 0.84, 0.93, 0.79

https://github.com/albertmeronyo/ConceptDrift[1] Shenghui Wang, Stefan Schlobach, Michael Klein. “What is Concept Drift and How to Identify It?”. EKAW

2010.[2] Pesquita C, Couto FM (2012) Predicting the Extension of Biomedical Ontologies. PLoS Comput Biol 8(9):

e1002630.[3] Ljiljiana Stojanovic. “Methods and Tools for Ontology Evolution” (2004).

Concept Drift

Compatibility? Remixability?

Reusability?

Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic

Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on

Semantic Statistics (SemStats) ISWC 2014.

Summary

RDF Data Cube: publishing and integrating multi-dimensional data in the Semantic Web

Dutch historical censuses (increasingly) published and queryable online

Discoverabililty, reusability and remixability of dimensions is important

Bottom-up concept scheme generation only semi-automatable

Concept drift (or concept stability) can be predicted accurately if enough historical data is available

Semantic representations can provide insight in statistical correlation

CEDAR Integrator

https://github.com/CEDAR-project/Integrator

LSD Dimensions

http://lsd-dimensions.org/

TabCluster

https://github.com/CEDAR-project/TabCluster

Concept Drift

https://github.com/albertmeronyo/ConceptDrift

Semantic Correlation

http://csarven.ca/sense-of-lsd-analysis

http//www.cedar-project.nl/

Thank you

cbs cedar presentation

data cells61

demographic data

fabric dutch census

raw data examples

statistical information

new data sources

dutch historical censuses

census dataset

Technology

common indexable tool hardware cataloghardware catalog ....

cbs 751 - cbs products

abs v cbs (cbs msj)

cbs, time warner presentation

cedar | cedar evaluation centre

cbs alumni presentation @ karriär- och alumni malmö...

powerpoint 프레젠테이션 · 2020-03-09 ·...

07 22 09 - presentation cbs

2013 cbs presentation (10/8/13)

cbs cloud presentation november 2012

cbs presentation

presentation by cbs students on star alliance

cbs presentation compressed

presentation - st. patricks catholic church, cedar falls, ia

cbs ppt presentation final 12_14

appendix i.ecological forest management: atlantic...

cbs sports network & clif bar sponsorship presentation

cbs presentation #2 – reporting and recording in a cbs...

cedar 15x90 hickory 15x90 hickory naturale gt915500r...

presentation cbs