bioshare: maelstrom research tools for data harmonization and co-analysis - isabel fortier -...
TRANSCRIPT
Maelstrom Research tools for harmonization and co-analysisIsabel Fortier and Dany Doiron
July 28, 2015
Presentation summary
Why harmonize/co-analyse?What is Maelstrom Research?
Link with BioSHaREStep-by-step harmonization/co-analysis approach and tools
Define research question and objectivesAssemble information and select studiesDefine variables and evaluate harmonization potentialProcess common format dataEstimate quality of harmonized datasetDisseminate and preserve final harmonization productsCo-analyse data
Why harmonize and co-analyse?
Obtain numbers and statistical power required to investigate, for example, gene-environment interactions and less common events.Better understand similarities and differences across studies or jurisdictions.Extend the scientific impact of individual studies (optimize return on the resources invested).
Increasing number of research programs harmonizing data (2000-2014)
Maelstrom Research
International research program created in 2012; enabled by partnerships established since 2004
Funded by research and infrastructure grants and services contractsHead office in Montreal (RI-MUHC; 14 staff members), but core activities in The Netherlands, United Kingdom and Canada.Maelstrom team:
Generate methods and software to support data cataloguing, harmonization and analysisAchieve methodological researchCreate web-based catalogues and harmonization platforms
Maelstrom Research : Link with BioSHaRE
BioSHaRE, one of the main founding partners of Maelstrom Research
Maelstrom and BioSHaRE worked together to develop key tools to support data harmonization and co-analysis and Maelstrom will permit to leverage and ensure continuity of key tools and resources developed
Key international data harmonization/federation projects using Maelstrom Tools
CPTP - Canadian Partnership for Tomorrow Project
InterConnect – a global initiative on diabetes gene-environment interaction
IALSA – Integrative Analysis of Longitudinal Studies on Aging
BBMRI-LPC – Integrative Analysis of Longitudinal Studies on Aging
20152011 2012 2013 2014 2015 2016 2017 2018
BioSHaRE – Biobank Standardization and Harmonization for Research Excellence in the European Union
SPIRIT - Sino-Quebec Perinatal Initiative in Research and Information Technology
Maelstrom Research Tools
Covered in V. Ferretti’s session
A statistical method and software to perform pooled data analysis without sharing individual-level data (federated analyses)
Mica
An application to create web portals for individual studies or networks of studies
A database software for studies to manage and harmonize data
A methodological approach to support data documentation, harmonization and integration
Covered in V. Ferretti’s session
Covered in this session
Covered in P. Burton’s session
DataSHaPER basic principles
Understand input dataStudy designs; what and how data was collected; quality of study-specific data
Ensure rigourSystematic harmonization process and quality control
Be transparentDocument how the harmonized are created to permit reproducibility and long term usage
Retrospective data harmonization steps
Study metadata catalogues
Set of variables for harmonization
Define research question
*Data access procedures *Data infrastructure
Step 0
Assemble knowledge and select studies
Step 1
Define targeted variables and assess
harmonization potential
Step 2
Process data
Algorithm/Statistical model
Harmonized dataset(s)
Study A dataset
Study B dataset
Study C dataset
Step 3
Retrospective data harmonization steps, Cont’d
Estimate quality of harmonized dataset
Step 4
Harmonized dataset(s)
Valid harmonized dataset(s)
Retrospective data harmonization steps
Generate harmonized output
Step 5
Valid harmonized dataset(s)
+Related documentation
Co-analyse data
Step 6
Valid harmonized
dataset Valid harmonized
dataset
Valid harmonized
datasetValid harmonized
dataset
Valid harmonized
dataset
Step 1: Assemble information and select studies
Study metadata
catalogues
Step 2: Define variables and evaluate harmonization potential
Define the list of core variables to be generated (DataSchema variables)Evaluate if study-specific datasets could be used to generate the DataSchema variable
DataSchema Variable
Dataset ADataset B
Dataset C
status: Completestatus: Complete
status: Incomplete
X
Harmonization status
Should Harmonize only if quality is ensuredNot all studies ca create all variables…
Study 1 Study 2 Study 3 Study 4 Study 5
Variable X
Study 1 Study 2 Study 3 Study 4 Study 5
Variable X
QUANTITY
PRECISION
QUANTITY
PRECISION
Quantity = Number of studies Precision = Good content equivalence
Step 2: Define variables and evaluate harmonization potential
Example: How do we harmonize “highest level of education” Example: How do we harmonize “highest level of education” attained?attained?
Different educational systems in different countries/jurisdictions
Different assessment items asked by each study
Step 2: Define variables and evaluate harmonization potential
Different educational systems in different jurisdictions
1. Pre vocational secondary education2. Senior general secondary education3. Junior general secondary education for adults4. Senior general secondary education for adults5. Vocational education, assistant level6. Vocational education, basic level7. Vocational education, professional training diploma8. Vocational education, middle management training diploma9. Vocational education, specialist training diploma10. Bachelor11. Master12. Higher professional education, post graduate course13. University post graduate professional degrees (e.g. degrees in
Medicine, dentists, pharmacists)14. Doctorate, PhD
Dutch qualifications
British qualifications Norwegian qualifications
Step 2: Define variables and evaluate harmonization potential
Different assessment items asked by each studyFINRISK
NCDS
LifeLines
Kora
Step 2: Define variables and evaluate harmonization potential
What categories/educational
levels are equivalent (i.e. what can be harmonized)?
ISCED level 0 Early childhood educationISCED level 1 Primary educationISCED level 2 Lower secondary educationISCED level 3 Upper secondary educationISCED level 4 Post-secondary non-tertiary educationISCED level 5 Short-cycle tertiary educationISCED level 6 Bachelor’s or equivalent levelISCED level 7 Master’s or equivalent levelISCED level 8 Doctoral or equivalent level
Variable to be harmonized:
Highest level of education attained
0 No education/Primary education (ISCED 0-1)
1 Secondary education (ISCED 2-3)
2
Higher education (including professional education, college and university) (ISCED 4-8)
9 Missing
Example: Alcohol quantity
Study A Study B Study CPeriod: Week (7 days)
Unit: Drinks/week
Period: Week days (Sunday to Thursday)Week end days (Friday to Saturday)
Unit:Per weekdaysPer weekend days
Period :Weekday day (working day)Week end day (non working day)
Unit:Per day
status: Complete
Variable: Number of red wine drinks
status: Complete
Number of red wine drinks per week
DataSchema variable
Comment: The number of drinks of alcohol is asked in separate questions for working days and non working days, without specifying the number of days of each period.
status: ?
Occurrence of breast cancer at any point during the life of the participant's biological mother.
Harmonized variable
Impossible to differentiate ‘No’ from ‘missing’
Yes No
?
Case 1 Case 2
Breast BladderBrainBronchus and lung
Breast BladderBrainBronchus and lung
Mother ever had cancer (Yes, No) Mother ever had (check, yes only)
Example: Mother Breast cancer
Example of factors to take into account when evaluating the harmonization potential
Specific wording of the questions and categoriesTiming (e.g. when the information is collected)Procedures to collect information (e.g. measured or reported by the participant)Skip patterns (e.g. is the question addressed to the same population?)Responses options (e.g. multiple/single responses)Information essential to the interpretation of the variable to be created (e.g. biochemical measure «x» need to know about fasting status)
Step 3: Process data
Explore study data and assess its qualityWhere relevant ensure data cleaningDevelop statistical models or algorithms to transform study-specific data into common variable format
Case Rule Script1 Direct mapping from source
variable
2 If HS_SIG_EVER = 1 OR HS_COL_EVER = 1 --> code to 1
If HS_SIG_EVER = 0 AND HS_COL_EVER = 0 --> code to 0
Study variablesCase 1: Ever had sigmoidoscopy or colonoscopyCase 2: Ever had sigmoidoscopy Ever had colonoscopy
Ever had sigmoidoscopy or colonoscopy
Variable
Step 3: Process data
Algorithmic transformationContinuous and categorical variables or both with different but combinable ranges or
categories (e.g. education level)
Simple calibration modelContinuous metrics with known calibration model (e.g. weight in kg or pounds)
Standardization modelContinuous constructs measured using different scales, with no calibration method or
bridging items (e.g. two independent memory scales)Latent variable model Continuous constructs measured using different scales, with no calibration method but
with bridging items (e.g. two memory scales, with some common items)
Multiple imputation modelsContinuous or categorical constructs measured using overlapping scales permitting to
impute missing values (e.g. activities of daily living)
Step 4: Estimate quality of harmonized datasetEvaluate quality and adjust where required
Statistical output differs across studies:differences in populations? Differences in questionnaire wording? Or study-specific data quality?
ALC_RED_WINE_WEEK Study 1 Study 2 Study 3 Study 4 Study 5Min. 0 0 0 0 01st Qu. 0 2 0 1 0Median 1 5 2 2 1Mean 2.62 8.026 3.027 3.525 2.3023rd Qu. 4 9 4 5 3Max. 90 720 80 50 42
NA's 180 (2.3%)
672 (3.4%)
2950(3%)
1842(22.6%)
108 (0.6%)
Step 5: Disseminate and preserve final harmonization products
Harmonized datasets and related documentation (information permitting proper understanding of the data process and quality)
200 harmonized
variables
50 harmonized
variables
100 harmonized
variables
30 harmonized
variables
Step 6: Co-analyse data
Pooled analysis•Data pooled in a central location and analyzed
Summary data meta-analysis•Study-specific data analyses done locally followed by a meta-analysis combining the study-level estimates
Federated analysis •Analyses done centrally, but the individual-level participant data remain on local servers
For example:
For example:
Study-level estimates
Step 6: Co-analyse data
Local procedures: Study 1
Local procedures: Study 2
Central procedures
Pooled analysis
Strength: high level of flexibility in statistical analyses since data is centrally storedLimitation: important governance, ethical and legal challenges to physically pooling data.
Step 6: Co-analyse data
Local procedures: Study 1
Local procedures: Study 2
Central procedures
Summary data meta-analysis
Strength: few ethics and data access requirements since only access to study estimatesLimitation: not flexible – analyses are limited to summary statistics produced by each study. New questions/parameters require new study specific estimates
Step 6: Co-analyse data
Statistical models and coefficients are shared
between local and central servers, but data remains local
Local procedures: Study 1
Local procedures: Study 2
Federated analysis
Strength: Studies retain complete control of their dataLimitation: time and effort to set up the infrastructure to support analyses
Acknowledgements
The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n°261433 (Biobank Standardisation and Harmonisation for Research Excellence in the European Union - BioSHaRE-EU), the Ministère de l’Économie Innovation et Exportation du Québec, the Canadian Partnership Against Cancer, the National Institute on Aging, and the Research Institute of the McGill University Health Centre