open data repositories and big data ard prasad drtc, indian statistical institute bangalore

27
Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Upload: kellie-hodges

Post on 19-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Open Data Repositories and

Big Data

ARD PrasadDRTC, Indian Statistical Institute

Bangalore

Page 2: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Our Interest

We are working on (semantics of Big Data)

Structured Data

Hosting data repositories

Metadata of Big data

Ontology

Linked Open Data (LOD) Presently NOT working on

Unstructured Data (especially on social networks)

Data Analytics

Page 3: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Semantic Web

Semantic web can be realized only when the web provides answers to queries than web pages

Page 4: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Scope

The Semantic Web is a Web of Data ... The collection of Semantic Web technologies (RDF, OWL, SKOS, SPARQL, etc.) provides an environment where application can query that data, draw inferences using vocabularies, etc. Linked data (http://www.w3.org/standards/semanticweb/data)

Page 5: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Metadata Of books/Documents (Dublin Core etc.) Products Events Processes Organizations Individuals (FOAF) Keywords (Ontology) Preservation (PREMIS) DATA

Page 6: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Looking Back

Open Access to Information (OAI)

•A Fairly successful movement, resulted in•Open Access Repositories (> 2000)•Open Access Journals (> 5000)

•Partially bridging digital divide in Social, Physical, Natural Sciences and Humanities,

Page 7: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Nature of Publications

Many publications use data. Actual article may not have complete data used

• For lack of space• Author might have overlooked the data• Author deliberately did not present data - so that others can not verify the data

Page 8: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

For Example

Some suspect that Sigmund Freud's data is of fictious persons, it is not just fictitious names

Page 9: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

If data is available ...

• Others may draw different conclusions contradictory to that of the author

• Others may deal with other facets of the data• Data Transparency supplements the Objectivity

and self corrective characteristics of Science

If “Case history of patients” is openly available, it will contribute significantly to medical research

Page 10: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Digital Divide• Social Sciences do not require laboratory

infrastructure• However, physical and natural sciences do

require expensive infrastructure• If experimental data is available to scientists that

do not have infrastructure, it will significantly reduce digital divide in Physical and Natural Sciences

ODA is a step toward transparency and quality in science

Page 11: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

For Example

• Human Genome data• Data from Accelerator Labs (CERN)• Recent controversy about particle moving faster

than light• Not surprisingly, astronomy data is openly

available even before the OA movement

Page 12: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Features of Open Data Repositories

• Metadata: specify who is the owner, creator etc

• license the data to waive your rights to facilitate bulk download Open Data

• Technology Tools: automate data extraction

• Ontology: Index data

Page 13: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Licences

Creative Commons licenses (apart from CCZero), GPL, BSD, etc are NOT quite appropriate for

open data licences

Page 14: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Open Data Licences

• Open Data Commons Public Domain Dedication and Licence (PDDL)

• Dedicate to the Public Domain (all rights waived)• Open Data Commons Attribution License• Attribution for data(bases)• Open Data Commons Open Database

License (OdbL)• Attribution-ShareAlike for data(bases)• Creative Commons CCZero• Dedicate to the Public Domain (all rights waived)

Page 15: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Amazon Web Services (AWS)

Public Data Sets on AWS• Annotated Human Genome Data provided by ENSEMBL– The Ensembl project produces genome databases for human

as well as almost 50 other species, and makes this information freely available.

• Various US Census Databases from The US Census Bureau– Demographic data– US Censuses– Summary information about Business and Industry– Economic Household Profile Data.

• UniGene provided by the National Center for Biotechnology Information

Page 16: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Data Repositories by Governments

Many countries are hosting their data on data.gov in various formats like RDF, XSL, JSON, CSV etc.

Ex: www.data.gov.in (India) Www.data.gov (USA) Www.data.gov.au (Australia) Www.data.gov.uk (UK)

Page 17: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Registry of Data Repositories

Popular Data Registries:- Databib and re3data.org

Databib connects to 978 data repositories and databases (agriculture,Geo-sciences,social Sciences,Biological sciences)

re3data.org currently lists 634 research data repositories from different disciplines and 586 of these are described in detail using the re3data.org schema.

In future, Databib and re3data.org are likely to get merged into one service.

Note: The registry entries provide URL to the data repository and also a brief description of it. Manually one has to visit and download the data from the data repository. Again, no protocol to expose metadata of data providers

Page 18: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Digital Curation

• Collecting verifiable digital assets• Providing digital asset search and retrieval• Certification of the trustworthiness and integrity

of the collection content• Semantic and ontological continuity and

comparability of the collection content• Use of open standards (formats) for term

preservation and future proofing by migration of data

Page 19: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Technology

• Data repositories are much larger than OA repositories

• Cloud Computing is a good solution (AWS uses)• Hadoop

Page 20: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Resource Description in terms of Metadata and Ontology

RDF: Resource Description Framework SKOS: Simple Knowledge Organization

System OWL: Web Ontology Language

SPARQL: SPARQL Protocol and RDF Query Language

Page 21: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

NoSQL DBMS

Key / Value Based Redis, MemcacheDB, etc.

Column Based Cassandra, HBase, etc

Document Based MongoDB, Couchbase, etc

Graph Based AllegroGraph, Neo4J, etc.

Page 22: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

DBpedia Data Set

Multi-domain ontology derived from Wikipedia 3.77 million “things” (entities - Entitypedia) 400 million “facts” Uses YAGO (Yet Another Great Ontology)

Page 23: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore
Page 24: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Entitypedia

Multilingual controlled vocabulary Entity matching Data quality and type checking Entity type specific services Semantic or faceted search and navigation on

entities Summarization of entities and concepts

Page 25: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

DRTC Projects

Living Knowledge (European Commission funded project on semantic web based on SRR's Analytico Synthetic Classification, completed)

ITPAR: India-Trento Program for Advanced Research

(Govts. Of India & Italy; Work on DERA, Ongoing)

AgInfra: Agriculture data (European Commission funded, Agriculture data, ongoing)

Page 26: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Immediate Plan

An international workshop on Big Data with ICSU/CODATA

Page 27: Open Data Repositories and Big Data ARD Prasad DRTC, Indian Statistical Institute Bangalore

Thank You