cedar: from fragment to fabric - dutch census data in a web of global cultural and historic...

63
CEDAR From fragment to fabric - Dutch census data in a web of global cultural and historic information http://cedar-project.nl/ Ashkan Ashkpour Albert Meroño-Peñuela Christophe Gueret

Upload: prelida-project

Post on 01-Jul-2015

105 views

Category:

Technology


1 download

DESCRIPTION

by Ashkan Ashkpour, Albert Meroño-Peñuela, Christophe Gueret (http://cedar-project.nl/), presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu

TRANSCRIPT

Page 1: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

CEDAR

From fragment to fabric - Dutch census

data in a web of global cultural and

historic information

http://cedar-project.nl/

Ashkan Ashkpour

Albert Meroño-Peñuela

Christophe Gueret

Page 2: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Affiliations

Page 3: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

The Jargon Slide

- Dimensions - Variables

- Codes - Values

- Concept Scheme - Classification

System

- Integrator - Mapper / debugger

- QB - DataCube

Page 4: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Ashkan:

Background: Economics and Informatics

PhD Student at: the International Institute of Social

History, E-humanities Group, Erasmus

University…..and affiliated with DANS

PhD Topic: Theory and Practice of Data

Harmonization

Last Paper: The Aggregated Dutch Historical

Censuses: Harmonization and RDF (Historical

Methods)

Page 5: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Presentation Outline

Why Census Project

History of the Census

Current State

Point of take-off

Census data in RDF

Harmonization

Page 6: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Double purpose of Linked Census Data

Improve information retrieval for the general public (incl. lay experts, students, researchers)

Create new data sources and possibly new research practices in social history, historic demography, general history, …

Create immediate access to digitized Dutch census data

Semantic modeling of Dutch census data

Further enriching of statistical information with context

Page 7: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

The Census

(…) the breach between the demographer and

the census was complete… Is the role of

censuses for historical demographers …

over? The census seems to have become

less en vogue as a source of demographic

research.

Thanks to Jan Kok

Page 8: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

The Census Use Case

2011-2015

“ The potential of the census for socio-

economic historians and historical

demographers is far from exhausted “

Thanks to Jan Kok

Page 9: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

• Source of historical statistical data, providing a

rich source of social, economic and

demographic data

• A relatively untapped source of information

information / Most research focuses on a

specific year or a subset instead of time series

• Although sometimes lagging behind social

reality, it contain specific information about a

nations population characteristics and needs at

a given time in history, providing invaluable

snapshots of the state of a nation

Page 10: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

The Historical Census Use Case

2011-2015

Page 11: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Door 2 Door

Page 12: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Historical Censuses

Page 13: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Historical Censuses

Page 14: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Historical Censuses

Page 15: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

End of the door-to-door

censuses

Page 16: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Background Last Traditional

Census

Bron: NIWI/CBS - Luuk Schreven

Page 17: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Digitization Efforts - 1997

Cooperation between CBS and NIWI

Page 18: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Conversion Process

Page 19: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Conversion Process

Page 20: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Conversion Process

Page 21: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

VT 1869 Plaatselijke indeling NB

Page 22: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

The census dataset (1795-

1971)

3 Types of Censuses

[Population/Occupation/Housing]

17 census years

Only left aggregated form

~3000 census tables

33283 annotations

17Million characters

Page 23: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Excel

Page 24: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Conversion Process

Page 25: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Structural Heterogeneity

How can we maintain the same structure and information

1 on 1 representation

Page 26: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Layout

Page 27: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Going to RDF

Model: RDF Data cube > Multi dimensional

statistical data

Supervised conversion

Need to define the layout structures per table

Page 28: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Going to RDF

Styling of 2,288 tables

Training and conversion at DANS

Thanks to Michael and Jetske

Page 29: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information
Page 30: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

RDF Statistics

310,585,567 total triples

389,132 hierarchical row headers

17,960,911 data cells

61,110 column headers

3,609 row properties

3,150 titles

1,581,546 row headers

274,404 metadata cells

See http://lod.cedar-project.nl/cedar/data.html

Page 31: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Everything in one system: What does this

mean ?

No separate files

Insights in # of variables

Availability of variables (preliminary analyses)

Straightforward harmonizations

Systematic data check

Visualizations

Etc..

Page 32: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Examples on raw data

Page 33: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Concept Drift

Identified 3 subsystems

Page 34: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Straight-forward harmonization

Number of teachers (HISCO 13490)

Page 35: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Straight-forward harmonization

Number of married women

Page 36: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Straight-forward harmonization

Number of inhabitants of Amsterdam

Page 37: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Helps to detect causes of singularities

No more teachers since 1909?

Harmonization rule is wrong?

Source data error? Incomplete source data?

Error during RDF Data Cube conversion?

Adding totals (i.e. counting people several times)?

Page 38: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Three tier model

Raw data is filled

Annotations

Harmonization layer

Page 39: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information
Page 40: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Enriching the Harmonization layer

Cleaning and correcting

Standardizing variables, values..

Enriching of the data

Mappings

Connecting to existing classifications systems

HISCO

Amsterdam Code

Creating variables, bottom up classification

systems

Key: bringing all these practices together

Page 41: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information
Page 42: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

HARMONIZING CENSUS

Harmonizing includes

• Geographical context

• Content variables

• Number itselfcorrection

estimation

smoothing

(esp. between geographical

levels)

imputing

interpolation

Standardization

Amsterdam code for municipalities

GIS references

Systemizing in other ways

Demographical

Housing

Occupational

Page 43: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

No 1 Solution

Ambiguity

Page 44: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Level of details

Geography

Features

Y1

Y2

Each decade puts different questions central.

Each decade uses a different approach.

We learn through the census about live in the past (data tables),

we learn equally about the administration (structure of the data tables)

We learn what is seen as a priority to be recorded at a certain time (gender information)

Page 45: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Geographical Referencing

CBS Code

Amsterdamse Code

Wageningse Code

NIDI Code

Change:

Horizontal and Vertical Changes, i.e. Different

Classifications but also themselves change

Page 46: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

O7

O3

O2

O

1

O4

O2

O3

O4

O1

O1

O5

O6

O4

O1

O3

O2

O

1

O4

O2

O3

O4

O1

O1

O5

O6

O4

O1

Year t1 Year t4Year t3Year t2

Occupations

Thanks to Andrea Scharnhorst

Page 47: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Occupations

HISCO

HISCAM

HISCLASS

Bottom Up classifications

Page 48: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Harmonization Examples

1889Klasse Naam I

Aardewerk, diamant, glas, kalk, steenen, enz..

1899Klasse Naam I

Fabricage van aardewerk, glas, kalk en steenen

Klasse Naam II

Bewerking van diamant en andere edelsteenen en fijne gesteenten

Page 49: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Example Age-ranges

Examples of estimating new variables:

1869 Age groups: 11-15, 16-18, 19-25

1879 Age groups: 11-13, 14-18, 19-21, 22-25

1889 Age groups: 11-13, 14-15, 16-18, 19-21, 22-25

1869-1889: 11-13, 14-15, 16-18, 19-21, 22-25

1869-1889: 11-18, 19-21, 22-25

Page 50: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Building Classification Systems

Page 51: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Bottom up example –reuse ?

1859-1920 (population census):

3000+ housing types classified manually

Classification

Standardized Variables

Standardized Values

Page 52: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Creating the Golden Classification

• Goal: two-sided

• Extract all values using a query(raw data)

• Build a flat list

• Grouped by Shared function

• Standardized and Values

• Incremental / Bottom up

Page 53: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Example : Klooster 2-72

•Klooster St. Magdelena 2-72

• St. Vincentius Klooster 2-72

• St. Oda-kloost. 2-72

• Fraterhuis St.Franciscus Xaverius 2-72

• St.Paulus Abdij -> 2-72

•Klooster der Zusters van het arme kindje Jezus en kostschool (St. Joseph) 2-72 2-62

Page 54: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Example:

“ Klooster der Broeders van de Onbevlekte Ontvangenis,

Kostschool en Weeshuis “

2-43

2-72

2-62

Insituut St Jozeph - ?

Inst Clarenbeek- ?

Woltersum ?

Page 55: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

link

20 Major classes

40 Minor classes

Page 56: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

No single solution

•Cleanging

•Enriching

•Standardizing / Restructuring

etc..

•HISCO

•Mapping

•Amsterdam Code / CBS Code

•GIS referencing

....

Page 57: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Reuse - for historical Statistical

Data

SDMX

But Housing, Religions, other demographic variables ?

HISCO (occupations) , A’dam Code (municipalities)

Straightforward harmonizations: Sex, Marital Status, Positions etc..

Page 58: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information
Page 59: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Challenge

Page 60: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Currently we have started with the standardizing of variables and values across the years

Straightforward harmonization covers a lot

Focus on creation of census specific variables such as the housing, religions, age, occupations etc..

Focus on most used variables

The more we put in the more specific our query

Page 61: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

Remarks

Many versions

Visualizations as detectors of errors

Machine readable formats invite playfulness!

Human creative work is irreplaceable.

Saving original structures

Allow different harmonizations on the same data

Correction and harmonization is an ongoing process

Integrated worklfow (from harmonized value tothe original images)

Page 62: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information

RDF good for

Open data publishing in the Web

Machine processable

Dynamic schemas

Easy links

Easy round tripping to Excel, CSV, Access

Time Series Variables Selection

Harmonized dataset

Web interfaces, data exploration, visualizations

Page 63: CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultural and Historic Information