on-demand service-based big data integration: optimized for research collaboration

23
1/23 Pradeeban Kathiravelu 1,2 , Yiru Chen 3 , Ashish Sharma 4 , Helena Galhardas 1 , Peter Van Roy 2 , Luís Veiga 1 On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration The 3 rd International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH), in conjunction with the 43rd International Conference on Very Large Data Bases. Munich, Germany. September 1, 2017. 1 INESC-ID / Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal 2 Université catholique de Louvain, Louvain-la-Neuve, Belgium 3 Peking University, Beijing, China 4 Department of Biomedical Informatics, Emory University, Atlanta, USA

Upload: pradeeban-kathiravelu

Post on 24-Jan-2018

2.523 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

1/23

Pradeeban Kathiravelu1,2, Yiru Chen3, Ashish Sharma4, Helena Galhardas1, Peter Van Roy2, Luís Veiga1

On-Demand Service-Based Big Data Integration:

Optimized for Research Collaboration

The 3rd International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH), in conjunction with the 43rd International Conference on Very Large Data Bases.

Munich, Germany. September 1, 2017.

1 INESC-ID / Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal2 Université catholique de Louvain, Louvain-la-Neuve, Belgium

3 Peking University, Beijing, China4 Department of Biomedical Informatics, Emory University, Atlanta, USA

Page 2: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

2/23

Introduction

● Scale and diversity of big data are rising. – Geographically distributed data of exabytes.– Structured, semi-structured, unstructured, or ill-formed data.

● Integration of data is crucial for data science.● Sharing of integrated data and results.

– Mandatory for reproducible research.

Page 3: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

3/23

Challenges in Medical Research for Big Data Integration

● Multiple types of data.– Imaging, clinical, and genomic.

● Numerous data sources.– No shared messaging protocol.

● Do we really need to integrate all the data?

Page 4: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

4/23

A Story of Medical Data Researchers...A Story of Medical Data Researchers...

Page 5: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

5/23

● Jim is interested in the effects of a medicine to treat brain tumor in patients of certain age groups.

Page 6: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

6/23

Observation - 1

● Various sources.– Service-based data access through APIs.

● Thanks to specifications such as HL7 FHIR.

● The researchers possess domain knowledge.● Integrate On-Demand.

– Avoid eager loading of binary data or its textual metadata.– Use the researcher query as an input in loading data.

● Scalable storage in-house. – Potential to load, integrate, index, and query unstructured data.

Page 7: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

7/23

● Paula has overlapping research interests with Jim.

Page 8: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

8/23

Observation - 2

● Load data only once per organization.– Bandwidth and storage efficiency.

Page 9: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

9/23● Sharing the research data with researchers,

beyond organization boundaries.

Page 10: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

10/23

Observation - 3

● Do not duplicate data!– We ``own`` our interest; not the data.

● Point to the data in the data sources.– Pointers to data like Dropbox Shared Links work well.

● Avoids outdated duplicate data.● Easy to maintain.

● APIs – Access the list of research data sets.

Page 11: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

11/23

Problems

● How to..– Load data from several service-based big data sources.

● Avoid duplicate downloads and near duplicate data.– Integrate disparate data and persist for future accesses.– Share pointers to data internally and externally.

Page 12: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

12/23

ÓbidosOOn-demand BBig Data IIntegration,

DDistribution, and OOrchestration SSystem

● Researcher query →

Narrow down the search space.● Define subsets of data that are

of interest.– Exploiting the well-defined

hierarchical structure of medical data.● Medical Images (DICOM) ● Clinical data ● ..

Page 13: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

13/23

Óbidos Approach● Hybrid of virtual and materialized data integration

approaches.– Lazy load of metadata: Load the matching subset of metadata.– Store integrated data and query results → scalable storage.

● Track already loaded data.– Near duplicate detection.– Download only updates (changesets).

● Efficient SQL queries on NoSQL storage.● Share pointers to the datasets rather than the dataset itself.● Generic design; implementation for medical research data.Generic design; implementation for medical research data.

Page 14: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

14/23

Óbidos Architecture

Page 15: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

15/23

Evaluation● Evaluation Data:

– Clinical data and DICOM imaging collections of TCIA.

● Benchmark Óbidos against eager and lazy ETL. – Performance of loading and querying data.

● Óbidos (inter- and intra- organization) against binary data sharing.– Space/bandwidth efficiency of data sharing.

Page 16: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

16/23

Workload CharacterizationVarious Entries in Evaluated Collections

Page 17: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

17/23

Data load time Change in total data volume (Same query and same interest)

● Observation:– Load time increases for eager and lazy ETL with total volume.– Load time for Óbidos remains constant.

● Total volume of data is irrelevant for Óbidos.

Page 18: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

18/23

Change in studies of interest (Same query and constant total data volume)

Data load time

● Observation:– Load time for eager and lazy ETL remains constant.– Load time increases for Óbidos with the interest.

● Converges to the load time of lazy ETL.

Page 19: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

19/23

Query completion time for the integrated data repository

● Observation:– We assume the corresponding data is already loaded.

● Thus, lazy and eager ETL perform similar.– Indexed scalable NoSQL architecture of Óbidos → Better performance.

Page 20: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

20/23

Efficiency in Sharing Medical Research Data

● Observation:– A constant-size UID is sufficient, intra-organization.– With number of series, Óbidos pointers grow, inter-organization.– Traditional binary data sharing:

shared data size = volume of the image series.

Page 21: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

21/23

Conclusion

● Óbidos offers on-demand service-based big data integration.– Fast and resource-efficient data analysis.– SQL queries over NoSQL data store for the integrated data.– Efficient data sharing without duplicating actual data.

● Future Work– Consume data from repositories of domains beyond medical data.

● EUDAT– Óbidos distributed virtual data warehouses.

● Leverage the proximity of the organizations in data integration and sharing.

Page 22: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

22/23

Acknowledgements

● NCI QIN grant (1U01CA187013, Resources for Development and Validation Of Radiomic Analyses and Adaptive Therapy).

● Google Summer of Code (2014, 2015, and 2016).● The Cancer Imaging Archive (TCIA).● Tyk and API Umbrella Teams.

Page 23: On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

23/23

Conclusion

● Óbidos offers on-demand service-based big data integration.– Fast and resource-efficient data analysis.– SQL queries over NoSQL data store for the integrated data.– Efficient data sharing without duplicating actual data.

● Future Work– Consume data from repositories of domains beyond medical data.

● EUDAT– Óbidos distributed virtual data warehouses.

● Leverage the proximity of the organizations in data integration and sharing.

Thank you! Questions?