bioshare: maelstrom research tools for data harmonization and co-analysis - isabel fortier -...

Maelstrom Research tools for harmonization and co-analysisIsabel Fortier and Dany Doiron

July 28, 2015

Presentation summary

Why harmonize/co-analyse?What is Maelstrom Research?

Link with BioSHaREStep-by-step harmonization/co-analysis approach and tools

Define research question and objectivesAssemble information and select studiesDefine variables and evaluate harmonization potentialProcess common format dataEstimate quality of harmonized datasetDisseminate and preserve final harmonization productsCo-analyse data

Why harmonize and co-analyse?

Obtain numbers and statistical power required to investigate, for example, gene-environment interactions and less common events.Better understand similarities and differences across studies or jurisdictions.Extend the scientific impact of individual studies (optimize return on the resources invested).

Increasing number of research programs harmonizing data (2000-2014)

Maelstrom Research

International research program created in 2012; enabled by partnerships established since 2004

Funded by research and infrastructure grants and services contractsHead office in Montreal (RI-MUHC; 14 staff members), but core activities in The Netherlands, United Kingdom and Canada.Maelstrom team:

Generate methods and software to support data cataloguing, harmonization and analysisAchieve methodological researchCreate web-based catalogues and harmonization platforms

Maelstrom Research : Link with BioSHaRE

BioSHaRE, one of the main founding partners of Maelstrom Research

Maelstrom and BioSHaRE worked together to develop key tools to support data harmonization and co-analysis and Maelstrom will permit to leverage and ensure continuity of key tools and resources developed

Key international data harmonization/federation projects using Maelstrom Tools

CPTP - Canadian Partnership for Tomorrow Project

InterConnect – a global initiative on diabetes gene-environment interaction

IALSA – Integrative Analysis of Longitudinal Studies on Aging

BBMRI-LPC – Integrative Analysis of Longitudinal Studies on Aging

20152011 2012 2013 2014 2015 2016 2017 2018

BioSHaRE – Biobank Standardization and Harmonization for Research Excellence in the European Union

SPIRIT - Sino-Quebec Perinatal Initiative in Research and Information Technology

Maelstrom Research Tools

Covered in V. Ferretti’s session

A statistical method and software to perform pooled data analysis without sharing individual-level data (federated analyses)

Mica

An application to create web portals for individual studies or networks of studies

A database software for studies to manage and harmonize data

A methodological approach to support data documentation, harmonization and integration

Covered in V. Ferretti’s session

Covered in this session

Covered in P. Burton’s session

DataSHaPER basic principles

Understand input dataStudy designs; what and how data was collected; quality of study-specific data

Ensure rigourSystematic harmonization process and quality control

Be transparentDocument how the harmonized are created to permit reproducibility and long term usage

Retrospective data harmonization steps

Study metadata catalogues

Set of variables for harmonization

Define research question

*Data access procedures *Data infrastructure

Step 0

Assemble knowledge and select studies

Step 1

Define targeted variables and assess

harmonization potential

Step 2

Process data

Algorithm/Statistical model

Harmonized dataset(s)

Study A dataset

Study B dataset

Study C dataset

Step 3

Retrospective data harmonization steps, Cont’d

Estimate quality of harmonized dataset

Step 4

Harmonized dataset(s)

Valid harmonized dataset(s)

Retrospective data harmonization steps

Generate harmonized output

Step 5

Valid harmonized dataset(s)

+Related documentation

Co-analyse data

Step 6

Valid harmonized

dataset Valid harmonized

dataset

Valid harmonized

datasetValid harmonized

dataset

Valid harmonized

dataset

Step 1: Assemble information and select studies

Study metadata

catalogues

Step 2: Define variables and evaluate harmonization potential

Define the list of core variables to be generated (DataSchema variables)Evaluate if study-specific datasets could be used to generate the DataSchema variable

DataSchema Variable

Dataset ADataset B

Dataset C

status: Completestatus: Complete

status: Incomplete

X

Harmonization status

Should Harmonize only if quality is ensuredNot all studies ca create all variables…

Study 1 Study 2 Study 3 Study 4 Study 5

Variable X

Study 1 Study 2 Study 3 Study 4 Study 5

Variable X

QUANTITY

PRECISION

QUANTITY

PRECISION

Quantity = Number of studies Precision = Good content equivalence


Example: How do we harmonize “highest level of education” Example: How do we harmonize “highest level of education” attained?attained?

Different educational systems in different countries/jurisdictions

Different assessment items asked by each study


Different educational systems in different jurisdictions

1. Pre vocational secondary education2. Senior general secondary education3. Junior general secondary education for adults4. Senior general secondary education for adults5. Vocational education, assistant level6. Vocational education, basic level7. Vocational education, professional training diploma8. Vocational education, middle management training diploma9. Vocational education, specialist training diploma10. Bachelor11. Master12. Higher professional education, post graduate course13. University post graduate professional degrees (e.g. degrees in

Medicine, dentists, pharmacists)14. Doctorate, PhD

Dutch qualifications

British qualifications Norwegian qualifications


Different assessment items asked by each studyFINRISK

NCDS

LifeLines

Kora


What categories/educational

levels are equivalent (i.e. what can be harmonized)?

ISCED level 0 Early childhood educationISCED level 1 Primary educationISCED level 2 Lower secondary educationISCED level 3 Upper secondary educationISCED level 4 Post-secondary non-tertiary educationISCED level 5 Short-cycle tertiary educationISCED level 6 Bachelor’s or equivalent levelISCED level 7 Master’s or equivalent levelISCED level 8 Doctoral or equivalent level

Variable to be harmonized:

Highest level of education attained

0 No education/Primary education (ISCED 0-1)

1 Secondary education (ISCED 2-3)

2

Higher education (including professional education, college and university) (ISCED 4-8)

9 Missing

Example: Alcohol quantity

Study A Study B Study CPeriod: Week (7 days)

Unit: Drinks/week

Period: Week days (Sunday to Thursday)Week end days (Friday to Saturday)

Unit:Per weekdaysPer weekend days

Period :Weekday day (working day)Week end day (non working day)

Unit:Per day

status: Complete

Variable: Number of red wine drinks

status: Complete

Number of red wine drinks per week

DataSchema variable

Comment: The number of drinks of alcohol is asked in separate questions for working days and non working days, without specifying the number of days of each period.

status: ?

Occurrence of breast cancer at any point during the life of the participant's biological mother.

Harmonized variable

Impossible to differentiate ‘No’ from ‘missing’

Yes No

?

Case 1 Case 2

Breast BladderBrainBronchus and lung

Breast BladderBrainBronchus and lung

Mother ever had cancer (Yes, No) Mother ever had (check, yes only)

Example: Mother Breast cancer

Example of factors to take into account when evaluating the harmonization potential

Specific wording of the questions and categoriesTiming (e.g. when the information is collected)Procedures to collect information (e.g. measured or reported by the participant)Skip patterns (e.g. is the question addressed to the same population?)Responses options (e.g. multiple/single responses)Information essential to the interpretation of the variable to be created (e.g. biochemical measure «x» need to know about fasting status)

Step 3: Process data

Explore study data and assess its qualityWhere relevant ensure data cleaningDevelop statistical models or algorithms to transform study-specific data into common variable format

Case Rule Script1 Direct mapping from source

variable

2 If HS_SIG_EVER = 1 OR HS_COL_EVER = 1 --> code to 1

If HS_SIG_EVER = 0 AND HS_COL_EVER = 0 --> code to 0

Study variablesCase 1: Ever had sigmoidoscopy or colonoscopyCase 2: Ever had sigmoidoscopy Ever had colonoscopy

Ever had sigmoidoscopy or colonoscopy

Variable

Step 3: Process data

Algorithmic transformationContinuous and categorical variables or both with different but combinable ranges or

categories (e.g. education level)

Simple calibration modelContinuous metrics with known calibration model (e.g. weight in kg or pounds)

Standardization modelContinuous constructs measured using different scales, with no calibration method or

bridging items (e.g. two independent memory scales)Latent variable model Continuous constructs measured using different scales, with no calibration method but

with bridging items (e.g. two memory scales, with some common items)

Multiple imputation modelsContinuous or categorical constructs measured using overlapping scales permitting to

impute missing values (e.g. activities of daily living)

Step 4: Estimate quality of harmonized datasetEvaluate quality and adjust where required

Statistical output differs across studies:differences in populations? Differences in questionnaire wording? Or study-specific data quality?

ALC_RED_WINE_WEEK Study 1 Study 2 Study 3 Study 4 Study 5Min. 0 0 0 0 01st Qu. 0 2 0 1 0Median 1 5 2 2 1Mean 2.62 8.026 3.027 3.525 2.3023rd Qu. 4 9 4 5 3Max. 90 720 80 50 42

NA's 180 (2.3%)

672 (3.4%)

2950(3%)

1842(22.6%)

108 (0.6%)

Step 5: Disseminate and preserve final harmonization products

Harmonized datasets and related documentation (information permitting proper understanding of the data process and quality)

200 harmonized

variables

50 harmonized

variables

100 harmonized

variables

30 harmonized

variables

Step 6: Co-analyse data

Pooled analysis•Data pooled in a central location and analyzed

Summary data meta-analysis•Study-specific data analyses done locally followed by a meta-analysis combining the study-level estimates

Federated analysis •Analyses done centrally, but the individual-level participant data remain on local servers

For example:

For example:

Study-level estimates


Local procedures: Study 1


Central procedures

Pooled analysis

Strength: high level of flexibility in statistical analyses since data is centrally storedLimitation: important governance, ethical and legal challenges to physically pooling data.




Central procedures

Summary data meta-analysis

Strength: few ethics and data access requirements since only access to study estimatesLimitation: not flexible – analyses are limited to summary statistics produced by each study. New questions/parameters require new study specific estimates


Statistical models and coefficients are shared

between local and central servers, but data remains local



Federated analysis

Strength: Studies retain complete control of their dataLimitation: time and effort to set up the infrastructure to support analyses

Acknowledgements

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n°261433 (Biobank Standardisation and Harmonisation for Research Excellence in the European Union - BioSHaRE-EU), the Ministère de l’Économie Innovation et Exportation du Québec, the Canadian Partnership Against Cancer, the National Institute on Aging, and the Research Institute of the McGill University Health Centre

bioshare: maelstrom research tools for data harmonization and co-analysis - isabel fortier -...

Health & Medicine