cbs cedar presentation
TRANSCRIPT
CEDAR
From fragment to fabric - Dutch census
data in a web of global cultural and
historic information
http://cedar-project.nl/
Ashkan Ashkpour
Albert Meroño Peñuela
Affiliations
Double purpose of Linked Census Data
Improve information retrieval for the general public (incl. lay experts, students, researchers)
Create new data sources and possibly new research practices in social history, historic demography, general history, …
Create immediate access to digitized Dutch census data
Semantic modeling of Dutch census data
Further enriching of statistical information with context
The Historical Census Use Case
2011-2015
Historical Censuses
Historical Censuses
End of the door-to-door
censuses
• Source of historical statistical data, providing a
rich source of social, economic and
demographic data
• A relatively untapped source of information
information / Most research focuses on a
specific year or a subset instead of time series
• “..specific information about a nations population
characteristics and needs at a given time in
history, providing invaluable snapshots of the
state of a nation”
Digitization Efforts – 1996
Cooperation between CBS and NIWI
Conversion Process
Conversion Process
Conversion Process
VT 1869 Plaatselijke indeling NB
The census dataset (1795-
1971)
3 Types of Censuses
[Population/Occupation/Housing]
17 census years
Only left aggregated form
2288 census tables
33283 annotations
17Million characters
Conversion Process
Structural Heterogeneity
How can we maintain the same structure
and information
1 on 1 representation
Using the Layout
Going to RDF
Model: RDF Data cube > Multi dimensional
statistical data
Supervised conversion
Need to define the layout structures per table
Going to RDF
Styling of 2,288 tables
Training and conversion at DANS
Thanks to Michael and Jetske
RDF Statistics
310,585,567 total triples
389,132 hierarchical row headers
17,960,911 data cells
61,110 column headers
3,609 row properties
3,150 titles
1,581,546 row headers
274,404 metadata cells
See http://lod.cedar-project.nl/cedar/data.html
Everything in one system: What does this
mean ?
No separate files
Insights in # of variables
Availability of variables (preliminary analyses)
Straightforward harmonizations
Systematic data check
Visualizations
Other debugging purposes
Examples on raw data
Examples on raw data
Number of teachers Number of married women
Three tier model
Raw data is filled
Annotations
Harmonization layer
Enriching the Harmonization layer
Cleaning and correcting
Standardizing variables and values..
Mappings
Connecting to existing (classifications) systems:
HISCO (historical occupations)
Amsterdam Code (historical municipalities)
SDMX (demographical variables)
Creating variables, bottom up classification systems (religious denominations, housing types, occupations, age ranges..
Key: bringing all these practices together
CEDAR goal: cross-query the Dutch historical censuses on the Web
?
1795 1830 1889 1930 1971
(aka integrating
~3K disparate
tables)
• Web publishable
• Machine processable
• Dynamic schema
• Easily link with other
datasets
Why Semantic Web
Technology? To W3C
Web publishable
Web exchangeable
Human & machine readable
Provide interesting links
To us
Finer granularity level (cell level)
Statistical comparability by leveraging semantic
descriptions
Provenance
Harmonization through linkage to other datasets
Towards 5-star Census Data
Towards 5-star Census Data
>2 years ago
2 years ago
“There are many situations where it
would be useful to be able to publish
multi-dimensional data, such as
statistics, on the web in such a way
that they can be linked to related data
sets and concepts.”
RDF Data Cube vocabulary
(QB)
RDF Data Cube vocabulary
(QB)
RDF Data Cube vocabulary
(QB)• SDMX compatible
• Defines cubes as a set of observations that consist
of dimensions, measures and attributes
• Dimensions: time period, region, sex (qb:DimensionProperty)
• Measure: population life expectancy (qb:MeasureProperty)
• Attribute: unit of measure = years, metadata status =
measured (qb:AttributeProperty)
Observation: “the measured life expectancy of males in
Newport in the period 2004-2006 is 76.7 years”
CEDAR Integrator
https://github.com/CEDAR-
project/Integrator
http://lod.cedar-project.nl/cedar/data.html
http://lod.cedar-project.nl/cedar/stats.html
http://lod.cedar-project.nl/maps/
Dimension Reusability
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;
cedar:population "12"^^xml:integer ;
maritalstatus:maritalStatus
maritalstatus:single ;
cedarterms:occupationPosition cedarterms:job-D ;
sdmx-dimension:sex sdmx-code:sex-F ;
cedarterms:occupation hisco:88030 ;
sdmx-dimension:refArea gg:11150 ;
cedarterms:belief hreligion:118 ;
cedarterms:houseType cedar:Klooster ;
prov:wasDerivedFrom
cedar:BRT_1889_08_T1-S0-K17 ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-K17-activity .
Dimension Reusability
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;
cedar:population "12"^^xml:integer ;
maritalstatus:maritalStatus
maritalstatus:single ;
cedarterms:occupationPosition cedarterms:job-D ;
sdmx-dimension:sex sdmx-code:sex-F ;
cedarterms:occupation hisco:88030 ;
sdmx-dimension:refArea gg:11150 ;
cedarterms:belief hreligion:118 ;
cedarterms:houseType cedar:Klooster ;
prov:wasDerivedFrom
cedar:BRT_1889_08_T1-S0-K17 ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-K17-activity .
Dimension Reusability
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;
cedar:population "12"^^xml:integer ;
maritalstatus:maritalStatus
maritalstatus:single ;
cedarterms:occupationPosition cedarterms:job-D ;
sdmx-dimension:sex sdmx-code:sex-F ;
cedarterms:occupation hisco:88030 ;
sdmx-dimension:refArea gg:11150 ;
cedarterms:belief hreligion:118 ;
cedarterms:houseType cedar:Klooster ;
prov:wasDerivedFrom
cedar:BRT_1889_08_T1-S0-K17 ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-K17-activity .
LSD Dimensions
http://lsd-dimensions.org/https://github.com/albertmeronyo/LSD-Dimensions
Hourly JSON-LD dumps
What if dimensions aren’t out
there?
Need to build them
Input: flat lists of non-standard values
Output: standard concept scheme
Knowledge intensive problem
https://github.com/CEDAR-project/TabCluster
Concept Drift
Census classification of
occupations as for
1859
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations
Concept Drift
Census classification of
occupations as for
1889
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations
Census classification of
occupations as for
1899
• Root node is void
• Depth 1: occupation groups
• Leaves: actual occupations
Concept Drift
RQ: Can we use past knowledge to predict
when and where will concept drift happen in an
ontology?
Theoretical framework: [1]
Data: a number of ontology versions
Method: supervised learning [2]
Features: structural, membership, usage [3]
Results: f-measures of 0.84, 0.93, 0.79
https://github.com/albertmeronyo/ConceptDrift[1] Shenghui Wang, Stefan Schlobach, Michael Klein. “What is Concept Drift and How to Identify It?”. EKAW
2010.[2] Pesquita C, Couto FM (2012) Predicting the Extension of Biomedical Ontologies. PLoS Comput Biol 8(9):
e1002630.[3] Ljiljiana Stojanovic. “Methods and Tools for Ontology Evolution” (2004).
Concept Drift
Compatibility? Remixability?
Reusability?
Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic
Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on
Semantic Statistics (SemStats) ISWC 2014.
Summary
RDF Data Cube: publishing and integrating multi-dimensional data in the Semantic Web
Dutch historical censuses (increasingly) published and queryable online
Discoverabililty, reusability and remixability of dimensions is important
Bottom-up concept scheme generation only semi-automatable
Concept drift (or concept stability) can be predicted accurately if enough historical data is available
Semantic representations can provide insight in statistical correlation
CEDAR Integrator
https://github.com/CEDAR-project/Integrator
LSD Dimensions
http://lsd-dimensions.org/
TabCluster
https://github.com/CEDAR-project/TabCluster
Concept Drift
https://github.com/albertmeronyo/ConceptDrift
Semantic Correlation
http://csarven.ca/sense-of-lsd-analysis
http//www.cedar-project.nl/
Thank you