cbs cedar presentation

Post on 13-Jul-2015

64 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CEDAR

From fragment to fabric - Dutch census

data in a web of global cultural and

historic information

http://cedar-project.nl/

Ashkan Ashkpour

Albert Meroño Peñuela

Affiliations

Double purpose of Linked Census Data

Improve information retrieval for the general public (incl. lay experts, students, researchers)

Create new data sources and possibly new research practices in social history, historic demography, general history, …

Create immediate access to digitized Dutch census data

Semantic modeling of Dutch census data

Further enriching of statistical information with context

The Historical Census Use Case

2011-2015

Historical Censuses

Historical Censuses

End of the door-to-door

censuses

• Source of historical statistical data, providing a

rich source of social, economic and

demographic data

• A relatively untapped source of information

information / Most research focuses on a

specific year or a subset instead of time series

• “..specific information about a nations population

characteristics and needs at a given time in

history, providing invaluable snapshots of the

state of a nation”

Digitization Efforts – 1996

Cooperation between CBS and NIWI

Conversion Process

Conversion Process

Conversion Process

VT 1869 Plaatselijke indeling NB

The census dataset (1795-

1971)

3 Types of Censuses

[Population/Occupation/Housing]

17 census years

Only left aggregated form

2288 census tables

33283 annotations

17Million characters

Conversion Process

Structural Heterogeneity

How can we maintain the same structure

and information

1 on 1 representation

Using the Layout

Going to RDF

Model: RDF Data cube > Multi dimensional

statistical data

Supervised conversion

Need to define the layout structures per table

Going to RDF

Styling of 2,288 tables

Training and conversion at DANS

Thanks to Michael and Jetske

RDF Statistics

310,585,567 total triples

389,132 hierarchical row headers

17,960,911 data cells

61,110 column headers

3,609 row properties

3,150 titles

1,581,546 row headers

274,404 metadata cells

See http://lod.cedar-project.nl/cedar/data.html

Everything in one system: What does this

mean ?

No separate files

Insights in # of variables

Availability of variables (preliminary analyses)

Straightforward harmonizations

Systematic data check

Visualizations

Other debugging purposes

Examples on raw data

Examples on raw data

Number of teachers Number of married women

Three tier model

Raw data is filled

Annotations

Harmonization layer

Enriching the Harmonization layer

Cleaning and correcting

Standardizing variables and values..

Mappings

Connecting to existing (classifications) systems:

HISCO (historical occupations)

Amsterdam Code (historical municipalities)

SDMX (demographical variables)

Creating variables, bottom up classification systems (religious denominations, housing types, occupations, age ranges..

Key: bringing all these practices together

CEDAR goal: cross-query the Dutch historical censuses on the Web

?

1795 1830 1889 1930 1971

(aka integrating

~3K disparate

tables)

• Web publishable

• Machine processable

• Dynamic schema

• Easily link with other

datasets

Why Semantic Web

Technology? To W3C

Web publishable

Web exchangeable

Human & machine readable

Provide interesting links

To us

Finer granularity level (cell level)

Statistical comparability by leveraging semantic

descriptions

Provenance

Harmonization through linkage to other datasets

Towards 5-star Census Data

Towards 5-star Census Data

>2 years ago

2 years ago

“There are many situations where it

would be useful to be able to publish

multi-dimensional data, such as

statistics, on the web in such a way

that they can be linked to related data

sets and concepts.”

RDF Data Cube vocabulary

(QB)

RDF Data Cube vocabulary

(QB)

RDF Data Cube vocabulary

(QB)• SDMX compatible

• Defines cubes as a set of observations that consist

of dimensions, measures and attributes

• Dimensions: time period, region, sex (qb:DimensionProperty)

• Measure: population life expectancy (qb:MeasureProperty)

• Attribute: unit of measure = years, metadata status =

measured (qb:AttributeProperty)

Observation: “the measured life expectancy of males in

Newport in the period 2004-2006 is 76.7 years”

CEDAR Integrator

https://github.com/CEDAR-

project/Integrator

http://lod.cedar-project.nl/cedar/data.html

http://lod.cedar-project.nl/cedar/stats.html

http://lod.cedar-project.nl/maps/

Dimension Reusability

cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;

cedar:population "12"^^xml:integer ;

maritalstatus:maritalStatus

maritalstatus:single ;

cedarterms:occupationPosition cedarterms:job-D ;

sdmx-dimension:sex sdmx-code:sex-F ;

cedarterms:occupation hisco:88030 ;

sdmx-dimension:refArea gg:11150 ;

cedarterms:belief hreligion:118 ;

cedarterms:houseType cedar:Klooster ;

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-K17-activity .

Dimension Reusability

cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;

cedar:population "12"^^xml:integer ;

maritalstatus:maritalStatus

maritalstatus:single ;

cedarterms:occupationPosition cedarterms:job-D ;

sdmx-dimension:sex sdmx-code:sex-F ;

cedarterms:occupation hisco:88030 ;

sdmx-dimension:refArea gg:11150 ;

cedarterms:belief hreligion:118 ;

cedarterms:houseType cedar:Klooster ;

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-K17-activity .

Dimension Reusability

cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;

cedar:population "12"^^xml:integer ;

maritalstatus:maritalStatus

maritalstatus:single ;

cedarterms:occupationPosition cedarterms:job-D ;

sdmx-dimension:sex sdmx-code:sex-F ;

cedarterms:occupation hisco:88030 ;

sdmx-dimension:refArea gg:11150 ;

cedarterms:belief hreligion:118 ;

cedarterms:houseType cedar:Klooster ;

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-K17-activity .

LSD Dimensions

http://lsd-dimensions.org/https://github.com/albertmeronyo/LSD-Dimensions

Hourly JSON-LD dumps

What if dimensions aren’t out

there?

Need to build them

Input: flat lists of non-standard values

Output: standard concept scheme

Knowledge intensive problem

https://github.com/CEDAR-project/TabCluster

Concept Drift

Census classification of

occupations as for

1859

• Root node is void

• Depth 1: occupation groups

• Leaves: actual occupations

Concept Drift

Census classification of

occupations as for

1889

• Root node is void

• Depth 1: occupation groups

• Leaves: actual occupations

Census classification of

occupations as for

1899

• Root node is void

• Depth 1: occupation groups

• Leaves: actual occupations

Concept Drift

RQ: Can we use past knowledge to predict

when and where will concept drift happen in an

ontology?

Theoretical framework: [1]

Data: a number of ontology versions

Method: supervised learning [2]

Features: structural, membership, usage [3]

Results: f-measures of 0.84, 0.93, 0.79

https://github.com/albertmeronyo/ConceptDrift[1] Shenghui Wang, Stefan Schlobach, Michael Klein. “What is Concept Drift and How to Identify It?”. EKAW

2010.[2] Pesquita C, Couto FM (2012) Predicting the Extension of Biomedical Ontologies. PLoS Comput Biol 8(9):

e1002630.[3] Ljiljiana Stojanovic. “Methods and Tools for Ontology Evolution” (2004).

Concept Drift

Compatibility? Remixability?

Reusability?

Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic

Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on

Semantic Statistics (SemStats) ISWC 2014.

Summary

RDF Data Cube: publishing and integrating multi-dimensional data in the Semantic Web

Dutch historical censuses (increasingly) published and queryable online

Discoverabililty, reusability and remixability of dimensions is important

Bottom-up concept scheme generation only semi-automatable

Concept drift (or concept stability) can be predicted accurately if enough historical data is available

Semantic representations can provide insight in statistical correlation

CEDAR Integrator

https://github.com/CEDAR-project/Integrator

LSD Dimensions

http://lsd-dimensions.org/

TabCluster

https://github.com/CEDAR-project/TabCluster

Concept Drift

https://github.com/albertmeronyo/ConceptDrift

Semantic Correlation

http://csarven.ca/sense-of-lsd-analysis

http//www.cedar-project.nl/

Thank you

top related