cedar: from fragment to fabric - dutch census data in a web of global cultural and historic...
DESCRIPTION
by Ashkan Ashkpour, Albert Meroño-Peñuela, Christophe Gueret (http://cedar-project.nl/), presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.euTRANSCRIPT
CEDAR
From fragment to fabric - Dutch census
data in a web of global cultural and
historic information
http://cedar-project.nl/
Ashkan Ashkpour
Albert Meroño-Peñuela
Christophe Gueret
Affiliations
The Jargon Slide
- Dimensions - Variables
- Codes - Values
- Concept Scheme - Classification
System
- Integrator - Mapper / debugger
- QB - DataCube
Ashkan:
Background: Economics and Informatics
PhD Student at: the International Institute of Social
History, E-humanities Group, Erasmus
University…..and affiliated with DANS
PhD Topic: Theory and Practice of Data
Harmonization
Last Paper: The Aggregated Dutch Historical
Censuses: Harmonization and RDF (Historical
Methods)
Presentation Outline
Why Census Project
History of the Census
Current State
Point of take-off
Census data in RDF
Harmonization
Double purpose of Linked Census Data
Improve information retrieval for the general public (incl. lay experts, students, researchers)
Create new data sources and possibly new research practices in social history, historic demography, general history, …
Create immediate access to digitized Dutch census data
Semantic modeling of Dutch census data
Further enriching of statistical information with context
The Census
(…) the breach between the demographer and
the census was complete… Is the role of
censuses for historical demographers …
over? The census seems to have become
less en vogue as a source of demographic
research.
Thanks to Jan Kok
The Census Use Case
2011-2015
“ The potential of the census for socio-
economic historians and historical
demographers is far from exhausted “
Thanks to Jan Kok
• Source of historical statistical data, providing a
rich source of social, economic and
demographic data
• A relatively untapped source of information
information / Most research focuses on a
specific year or a subset instead of time series
• Although sometimes lagging behind social
reality, it contain specific information about a
nations population characteristics and needs at
a given time in history, providing invaluable
snapshots of the state of a nation
The Historical Census Use Case
2011-2015
Door 2 Door
Historical Censuses
Historical Censuses
Historical Censuses
End of the door-to-door
censuses
Background Last Traditional
Census
Bron: NIWI/CBS - Luuk Schreven
Digitization Efforts - 1997
Cooperation between CBS and NIWI
Conversion Process
Conversion Process
Conversion Process
VT 1869 Plaatselijke indeling NB
The census dataset (1795-
1971)
3 Types of Censuses
[Population/Occupation/Housing]
17 census years
Only left aggregated form
~3000 census tables
33283 annotations
17Million characters
Excel
Conversion Process
Structural Heterogeneity
How can we maintain the same structure and information
1 on 1 representation
Layout
Going to RDF
Model: RDF Data cube > Multi dimensional
statistical data
Supervised conversion
Need to define the layout structures per table
Going to RDF
Styling of 2,288 tables
Training and conversion at DANS
Thanks to Michael and Jetske
RDF Statistics
310,585,567 total triples
389,132 hierarchical row headers
17,960,911 data cells
61,110 column headers
3,609 row properties
3,150 titles
1,581,546 row headers
274,404 metadata cells
See http://lod.cedar-project.nl/cedar/data.html
Everything in one system: What does this
mean ?
No separate files
Insights in # of variables
Availability of variables (preliminary analyses)
Straightforward harmonizations
Systematic data check
Visualizations
Etc..
Examples on raw data
Concept Drift
Identified 3 subsystems
Straight-forward harmonization
Number of teachers (HISCO 13490)
Straight-forward harmonization
Number of married women
Straight-forward harmonization
Number of inhabitants of Amsterdam
Helps to detect causes of singularities
No more teachers since 1909?
Harmonization rule is wrong?
Source data error? Incomplete source data?
Error during RDF Data Cube conversion?
Adding totals (i.e. counting people several times)?
Three tier model
Raw data is filled
Annotations
Harmonization layer
Enriching the Harmonization layer
Cleaning and correcting
Standardizing variables, values..
Enriching of the data
Mappings
Connecting to existing classifications systems
HISCO
Amsterdam Code
Creating variables, bottom up classification
systems
Key: bringing all these practices together
HARMONIZING CENSUS
Harmonizing includes
• Geographical context
• Content variables
• Number itselfcorrection
estimation
smoothing
(esp. between geographical
levels)
imputing
interpolation
Standardization
Amsterdam code for municipalities
GIS references
Systemizing in other ways
Demographical
Housing
Occupational
No 1 Solution
Ambiguity
Level of details
Geography
Features
Y1
Y2
Each decade puts different questions central.
Each decade uses a different approach.
We learn through the census about live in the past (data tables),
we learn equally about the administration (structure of the data tables)
We learn what is seen as a priority to be recorded at a certain time (gender information)
Geographical Referencing
CBS Code
Amsterdamse Code
Wageningse Code
NIDI Code
Change:
Horizontal and Vertical Changes, i.e. Different
Classifications but also themselves change
O7
O3
O2
O
1
O4
O2
O3
O4
O1
O1
O5
O6
O4
O1
O3
O2
O
1
O4
O2
O3
O4
O1
O1
O5
O6
O4
O1
Year t1 Year t4Year t3Year t2
Occupations
Thanks to Andrea Scharnhorst
Occupations
HISCO
HISCAM
HISCLASS
Bottom Up classifications
Harmonization Examples
1889Klasse Naam I
Aardewerk, diamant, glas, kalk, steenen, enz..
1899Klasse Naam I
Fabricage van aardewerk, glas, kalk en steenen
Klasse Naam II
Bewerking van diamant en andere edelsteenen en fijne gesteenten
Example Age-ranges
Examples of estimating new variables:
1869 Age groups: 11-15, 16-18, 19-25
1879 Age groups: 11-13, 14-18, 19-21, 22-25
1889 Age groups: 11-13, 14-15, 16-18, 19-21, 22-25
1869-1889: 11-13, 14-15, 16-18, 19-21, 22-25
1869-1889: 11-18, 19-21, 22-25
Building Classification Systems
Bottom up example –reuse ?
1859-1920 (population census):
3000+ housing types classified manually
Classification
Standardized Variables
Standardized Values
Creating the Golden Classification
• Goal: two-sided
• Extract all values using a query(raw data)
• Build a flat list
• Grouped by Shared function
• Standardized and Values
• Incremental / Bottom up
Example : Klooster 2-72
•Klooster St. Magdelena 2-72
• St. Vincentius Klooster 2-72
• St. Oda-kloost. 2-72
• Fraterhuis St.Franciscus Xaverius 2-72
• St.Paulus Abdij -> 2-72
•Klooster der Zusters van het arme kindje Jezus en kostschool (St. Joseph) 2-72 2-62
Example:
“ Klooster der Broeders van de Onbevlekte Ontvangenis,
Kostschool en Weeshuis “
2-43
2-72
2-62
Insituut St Jozeph - ?
Inst Clarenbeek- ?
Woltersum ?
link
20 Major classes
40 Minor classes
No single solution
•Cleanging
•Enriching
•Standardizing / Restructuring
etc..
•HISCO
•Mapping
•Amsterdam Code / CBS Code
•GIS referencing
....
Reuse - for historical Statistical
Data
SDMX
But Housing, Religions, other demographic variables ?
HISCO (occupations) , A’dam Code (municipalities)
Straightforward harmonizations: Sex, Marital Status, Positions etc..
Challenge
Currently we have started with the standardizing of variables and values across the years
Straightforward harmonization covers a lot
Focus on creation of census specific variables such as the housing, religions, age, occupations etc..
Focus on most used variables
The more we put in the more specific our query
Remarks
Many versions
Visualizations as detectors of errors
Machine readable formats invite playfulness!
Human creative work is irreplaceable.
Saving original structures
Allow different harmonizations on the same data
Correction and harmonization is an ongoing process
Integrated worklfow (from harmonized value tothe original images)
RDF good for
Open data publishing in the Web
Machine processable
Dynamic schemas
Easy links
Easy round tripping to Excel, CSV, Access
Time Series Variables Selection
Harmonized dataset
Web interfaces, data exploration, visualizations