canadian health census to lod

35
Exploiting the Canadian Health Census data as LOD Ahmad Chan 3449014 Topics in Web Science CS3773 Winter Term 2013

Upload: syed-ahmad-chan-bukhari-phd

Post on 28-Jan-2018

542 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Canadian health census to lod

Exploiting the Canadian Health Census data as LOD

Ahmad Chan3449014

Topics in Web ScienceCS3773

Winter Term 2013

Page 2: Canadian health census to lod

Motivation

●Government collected statistical data (census data) contains important information.●Can be exploited for needs assessment, to yield new policies and for accountability .●Emerging trend to release the government information all over the world.●Inspired by www.data.govInspired by www.data.gov www.data.gov.ukInspired by www.data.gov www.data.gov.uk http://opendata.ie/

*

Page 3: Canadian health census to lod

*

Page 4: Canadian health census to lod

*

Page 5: Canadian health census to lod

*

Page 6: Canadian health census to lod

*

Page 8: Canadian health census to lod

Problem Statement

●Available Government data is unstructured and redundant available as: text files, excel sheets and etc.

●Data analysis and to get comparative information is quite challenging.

●Valuable information can be derived from health census data for critical decision making.

●There is a need for instantly consumable datasets to encourage the data usability.

●The interoperability, scalability and usability could not be achieved with conventional data formats.

*

Page 9: Canadian health census to lod

Detailed Goals for Project

●Acquiring and refining the public health census data

●Transforming into W3C recommended flexible and interoperable standard RDF (Resource Description Framework) format.

●Integration of publicly available well known semantic vocabularies

●Tuning the RDFized data according to LOD standards

●Providing the graphical front end for querying (SPARQL endpoint)

●Configuring the linked open data explorer

●Hook it up with LOD cloud*

Page 10: Canadian health census to lod

What is the Open Government Data (OGD)actually?

*

Page 11: Canadian health census to lod

Some concepts and definitions

Open data Open data is data which you can use more or less freely. It’s generally available on the web, and uses non-proprietary formats like XML, CSV and etc. Linked Data Linked data is data which contains links to other datasets. Generally these will use URIs which are resolvable to discover more facts. RDF (Resource Description Framework) RDF is a useful data-structure for creating interoperable data. It has a number of file formats for exchanging this data. Most common is RDF/XML. Linked Open Data (aka LOD) is a common term, and as you can see is usually going to be in RDF too. The key thing is not to get put off by the linking. Add links when they provide value to your data and will help people using your data (yourself included) do more with it.

*

Page 12: Canadian health census to lod

Some concepts and definitions

*

Page 13: Canadian health census to lod

Methodology

*

Page 14: Canadian health census to lod

Data Acquisition ResourcesDataset Detail Source

Breastfeeding Practices Breastfeeding practices, by age group of mothers, recent mothers aged 15 to 49, Canada and provinces

http://www.data.gc.ca

Breast Cancer Survival Five-year survival estimates for breast cancer cases, by age group and sex, population aged 15 to 99

http://www.statcan.gc.ca/

Treatable Diseases Death Deaths due to medically treatable diseases, by selected causes of death, selected age groups and sex

http://www.statcan.gc.ca/

Smoking Practices Changes in smoking between 1994/1995 and 2010/2011, household population aged 12 and over who reported on smoking every 2 years . http://www.data.gc.ca

Family Doctor Satisfaction Patient satisfaction with most recent family doctor or other physician care received in past 12 months

http://www.statcan.gc.ca/

Kids Physical Activities Children's participation in physical activities, in hours per week, by sex, household population aged 6 to 11

http://www.statcan.gc.ca/

Health Indicator Health indicator profile, annual estimates, by age group and sex, Canada, provinces

http://www.data.gc.ca

*

Page 15: Canadian health census to lod

Data Manipulation

●Data Prescreening●Manual clean to acquire the quality data

●Deep Data Cleaning●Deleting/merging columns

●Initial Transformation●From Unstructured to relational

●Tool Used●Google refine (Desktop based version)

*

Page 16: Canadian health census to lod

RDF Foundry

●Transformation of relational database to RDF●Choosing the appropriate vocabularies●Define your own vocabularies●Programmical Mapping (D2R not maps according to your requirement)●I tried D2R, Triplify and Virtuoso (all three famous tool), all have limitataions

*

Page 17: Canadian health census to lod

Semantic Vocabularies Used

Ontology/ Vocabularies Prefixes Namespaces

FOAF: Friend Of A Friend foaf http://xmlns.com/foaf/0.1/

DBpedia Ontology dbpedia http://dbpedia.org/ontology/

Dublin Core dc http://purl.org/dc/elements/1.1/

Dublin Core Terms dcterms http://purl.org/dc/terms/

SIOC Ontology sioc http://rdfs.org/sioc/ns#

SKOS ontology skos http://www.w3.org/2004/02/skos/core#

Time Ontology time http://www.w3.org/TR/owl-time/

Relationship Ontology rel http://vocab.org/relationship/

Biography Ontology bio http://vocab.org/bio/0.1/

Hc2lod Ontology hc2lod http://cbakerlab.unbsj.ca/ontologies/hc2lod.owl

*

Page 18: Canadian health census to lod

RDBMS to RDF Mapping (a view)

*

Page 19: Canadian health census to lod

Exposing &Integration

●At this stage, I configured Pubby and snorql●Pubby is quite famous LOD explorer●Snorql is the SPARQL end point for querying●I uploaded the data files on CKAN which is registry of LOD after getting permission from LOD cloud admins.●Setup a GUI dashboard

*

Page 20: Canadian health census to lod

Some Sample queries

●SPARQL Query 1: Show me the years and number of breast cancer patients who were reported as survival patients among females between the ages of 40 to 49 years.

SELECT DISTINCT ?year ?value WHERE {

?patient foaf:gender "Female".

?patient dbpedia:unitCost "Number of cases".

?patient dbpedia:statisticValue ?value.

?patient dbpedia:year ?year.

?patient foaf:age "40 to 49 years".

?patient rdf:type akt:person-being-visited.

}

ORDER BY DESC(?value)

*

Page 21: Canadian health census to lod

Some Sample queries

●SPARQL Query 2: Give me total number of breast feeding mothers from New Brunswick province who have the ages between 20 to 24.

SELECT count (*)

Where {

?person dcterms:Location "New Brunswick".

?person rdf:type bio:immediatelyPrecedingEvent.

?person foaf:age "20 to 24 years".

}

*

Page 22: Canadian health census to lod

Some Sample queries

●Show the number of deaths reported due to Gallbladder and Prostate cancer among male patients Canada wise during 2001 to 2003.

SELECT DISTINCT ?death ?cancerTypeWhere {?person foaf:gender "Male".?person dbpedia:part "Gallbladder".?person dbpedia:part "Prostate".?person dbpedia:statisticValue ?death.?Cancer dbpedia:part ?cancerType.?year dbpedia:year "2001 to 2003".?person rdf:type akt:Knowledge-Lifecycle. }Limit 50

*

Page 23: Canadian health census to lod

Some Sample queries

SPARQL Query 4: Display the age group among female individuals from New Brunswick province who has maximum practice in smoking.

SELECT DISTINCT ?AgeGroup

Where{

?person rdf:type dbpedia:Activity.

?person foaf:gender "Female".

?person dcterms:Location "New Brunswick".

{

SELECT ?statval

Where{?person rdf:type dbpedia:Activity .

?person foaf:gender "Female" .

?person dcterms:Location "New Brunswick" .

?person dbpedia:statisticValue ?statval.

}

ORDER BY DESC(?statval)

limit 1

}

?person dbpedia:statisticValue ?statval.

?person foaf:age ?AgeGroup.

}

LIMIT 10

*

Page 24: Canadian health census to lod

Demo Screenshots

*

Page 25: Canadian health census to lod

Tools used

vocabulary publishing platform for the Web of Data

● SNORQL● Pubby● Joeski● JENA● JSP

*

Page 26: Canadian health census to lod

Tips, Tricks and Resources

●Appropriate available RDF vocabulary http://ws.nju.edu.cn/falcons/objectsearch/index.jsp (Falcons Semantic engine)●http://lov.okfn.org/dataset/lov/ (Linked open vocabularies)●http://datacatalogs.org/ (Worldwide open data sets)●http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSRDF (Easy tool for LOD AND Open data)

*

Page 27: Canadian health census to lod

Conclusions

●Goal was to transform the raw health census to LOD and its Linkage with LOD cloud.●Demo page is vailable http://cbakerlab.unbsj.ca:8080/hc2lod/index1.jsp●SPARQL end point http://cbakerlab.unbsj.ca:8080/hc2lod/snorql/●CH2LOD explorer http://cbakerlab.unbsj.ca:8080/hc2lod/ ●ckAN data hub of LOD http://datahub.io/dataset/ch2lod

*

Page 28: Canadian health census to lod

Next Steps / Future Work

●Will extend with more data sets relating to health domain●Will try to define the LOD quality metrics●will integrate the visualization tool with SPARQL endpoint●Will add an additional layer for LODD

*

Page 29: Canadian health census to lod

Critical Commentary

●Availability of open data●Mostly available health census data is redundant and incomplete●Unavailability of LOD logical schema builder●There is not hard fast criteria to measure the quality of data (provenance issue)●Lacking of well known vocabularies which match with your domain.

*

Page 30: Canadian health census to lod

Interesting Facts

*

Facts derived from Health census data

Page 31: Canadian health census to lod

Interesting Facts

*

Facts derived from Health census data

Page 32: Canadian health census to lod

*

Interesting Facts

Facts derived from Health census data

Page 33: Canadian health census to lod

*

Interesting Facts

Page 34: Canadian health census to lod

References

1. Improving access to government through better use of the web (2009). URL

http://www.w3.org/TR/egov-improving/

2. C. Bizer, R. Cyganiak, T. Heath, How to publish linked data on the web. Retrieved February

11, 2013 from http://www4.wiwiss.fuberlin.de/bizer/pub/LinkedDataTutorial/

3. S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, D. Aumueller, Triplify: light-weight linked

data publication from relational databases. In: WWW ’09: Proceedings of the 18th interna-

tional conference on World wide web ACM, New York, NY, USA, (2009). Pp. 621–630.

4. C. Bizer, A. Seaborne, A, D2RQ-treating non-RDF databases as virtual RDF graphs (2004)

5. O. Erling, I. Mikhailov, Rdf support in the virtuoso DBMS. Networked Knowledge-

Networked Media, (2009). Pp. 7–24

6. J. Hendler, J. Holm, C. Musialek, G. Thomas, US Government Linked Open Data: Seman-

tic.data.gov, Intelligent Systems, (2012). 27 (3): pp. 25-31.

7. F. Zhichun, P. Christen, M. Boot, Automatic Cleaning and Linking of Historical Census Data

Using Household Information, IEEE 11th International Conference on Data Mining Work-

shops (ICDMW), (2011): pp. 413-420.

8. J. D. Fernández, M.A. Martínez-Prieto,C. Gutiérrez, Publishing open statistical data: the

Spanish census, Proceedings of the 12th Annual International Digital Government Research

Conference: Digital Government Innovation in Challenging Times, (2011): pp. 20-25

*

Page 35: Canadian health census to lod

*

Thanks

Any Question?