The NIDDK Repositories Adding value to shared Session 5... · The NIDDK Repositories – Adding value…

Download The NIDDK Repositories Adding value to shared Session 5... · The NIDDK Repositories – Adding value…

Post on 26-Aug-2018




0 download


  • The NIDDK Repositories Adding value to shared resources

    Rebekah Rasooly NIDDK, NIH

    May, 2013

  • Central repository components

    Contract funded since 2003.

    Biosample repository (Fisher): archival storage of biological specimens Database repository (RTI): maintain archival datasets, respond to queries about data and stored samples Genetics repository (Rutgers Univ.): create immortalized cell lines, DNA extraction

  • Samples and data stored from >50 major multi-site clinical studies in diabetes, digestive, kidney, liver, and urologic diseases

    Each study collects according to its own protocols

    43 datasets available for sharing

    23 GWAS datasets available for sharing through dbGAP

    DNA and/or biosamples available for sharing from 28 studies

    The NIDDK Central Repositories holdings:

    Biosample Repository Genetics Repository Affiliated Repositories

    Total samples 7,384,858 113,057 560,191

  • Types of studies Diabetes and Obesity Studies DCCT/EDIC (The Type 1 Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications) DPP (Diabetes Prevention Program) DPPOS (The Diabetes Prevention Program Outcome Study) LookAHEAD (Action for Health in Diabetes) HEALTHY (Middle-School Based Primary Prevention Trial of Type 2 Diabetes) TrialNet - TN01 (NATURAL HISTORY STUDY OF THE DEVELOPMENT OF TYPE 1 DIABETES) TEDDY (The Environmental Determinants of Diabetes in the Young)

    Kidney Studies AASK Trial (The African American Study of Kidney Disease and Hypertension Study) CRIC (Chronic Renal Insufficiency Cohort Study) MDRD (The Modification of Diet in Renal Disease)

    Liver Disease Studies A2ALL (The Adult-to-Adult Living Donor Liver Transplantation Cohort Study) HALT-C (The Hepatitis C Antiviral Long-term Treatment against Cirrhosis) VIRAHEP-C (The Study of Viral Resistance to Antiviral Therapy of Chronic Hepatitis C)

    Urology Studies MTOPS (The Medical Therapy of Prostatic Symptoms) SISTEr (The Stress Incontinence Surgical Treatment Efficacy Trial)

  • NIDDK has custodianship of all samples and data transferred to the Repositories and no IP protections are attached

    The Steering Committee of each study or study group has control of the samples and data during a proprietary period (2 years after the end of the study or study increment)

    Expensive resources have to be useful: the Repositories sharing policies

  • Remove all identifiers, except some elements of dates (Limited data set)

    Collect all forms, MOPs, key papers and analytic datasets from those papers

    Reconcile sample list with phenotypic data

    Carry out Dataset Integrity Check (DSIC) process

    Curating studies

  • Curating studies collect all forms, etc.

  • Curating studies perform Dataset Integrity Check

    verify that published results from the study can be reproduced using the archived datasets

    perform a small number of analyses to duplicate published results intent is to provide confidence that the dataset distributed by the

    NIDDK repository is a true copy of the study data does not attempt to resolve minor or inconsequential discrepancies

    with published results

  • using the CaBIG Common Biorepository Model (CBM) of 30

    variables an additional 140 variables that are domain- or study-specific

    Curation for common variables

    Index Variable_Name Description CBM (curated for

    all studies)







    Study_Specific* Other_Common**

    1 Ethnicity Ethnicity x x x x

    2 Gender Gender x x x x

    3 Race Race x x x x

    4 ace_arb Use of antihypertensives (ACE inhibitors, ARBs) x x x

    5 acr Albumin to creatinine ratio x

    6 add_dx prior addisons disease x

    7 aer Albumin Excretion Rate x

    8 age Age x x x x

    9 age_transplant Age at transplant x x

    10 ageatonset Age at IDDM onset x

    11 agegroup Age group x

    12 aki Acute kidney injury (aka ARF) x x

    13 alcohol Frequent alcohol use x x

    14 assign Treatment group x

    15 beckqaire Severe anxiety(BECK) x

    'What's in the NIDDK CDR?'--public query tools for the NIDDK central data repository. Pan H, Ardini MA, Bakalov V, et al., 2013, Database (Oxford).

  • Studies are searchable

  • No analytic dataset impossible to recreate results in papers

    No data dictionary for variables used in analysis impossible to recreate results in paper

    Errors in calculation

    Poor or incomplete linkage of sample lists to phenotypic data

    Sample labeling issues, including:

    Labels applied incorrectly cannot be read by barcode scanner

    Duplicate ids

    Empty or nearly empty vials

    Incorrectly preserved samples

    Curation issues

  • Using the Repository requests for data and samples

    Requests for Repository materials


    requests for


    requests for

    genetic samples

    total number of unique

    samples data requests

    2004 0 7 1936 0

    2005 15 7 3658 4

    2006 47 12 9391 5

    2007 49 6 6979 15

    2008 50 24 29271 16

    2009 64 45 48561 33

    2010 98 34 64195 29

    2011 149 14 44110 58

    2012 109 16 73113 94

    2013 55 6 10638 14

  • Using the data and samples

    91 publications by researchers who gained access to data and samples through the NIDDK Repository, including:

    Papers based on the GWAS data sets in dbGAP

    A paper that re-examined the data from a study of dialysis intensity (HEMO) and suggested a re-interpretation the major study conclusion (Argyropoulos, C et al., 2009, J. Am. Soc. Nephrol., 20, 2034-2043).

    A paper that re-analyzed the IBD Genetics GWAS data to identify additional loci (Elding, H et al., 2011, Am J Hum Genet. 2011 Dec 9;89(6):798-805)

    Publications on novel analytic methods or markers in NIDDK Repository-supplied samples

  • dbGAP the NIDDK Repository: two different curatorial approaches

    NIDDK Repository dbGAP

    Manual Curation Automated curation

    Elements of dates accepted No elements of dates accepted


    Linkage to samples No linkage to samples

    Expensive - ~$1M/year for the

    Data Repository

    Minimal costs the NLM is

    bearing the costs of acquiring


    Low volume High volume

  • dbgap NIDDK Repository

    Year approved #

    downloaded pct

    downloaded approved pct


    2010 54 28 52% 29 100%

    2011 91 52 57% 58 100%

    2012 137 83 61% 94 100%

    2013 53 17 32% 14 100%

    dbGAP the NIDDK Repository: two different curatorial approaches

  • Curation is expensive More familiarity = more sophisticated use of data If investigators are not obliged to share their data,

    they can get by with poor documentation and processing/storage errors

    Lessons learned and cautionary notes

  • Project Officers Beena Akolkar Paul Eggers Bob Karp Contracting Specialist Rich Bailey Repository Specialists Sharon Kay Mobley Kris Moen

    NIDDK Repository Staff

    RTI Data Repository Phil Cooley, PI Helen Pan, Sylvia Tan ThermoFisher, Biosample Repository Heather Higgins, PI Rutgers Univ., Genetics Repository Jay Tischfield, PI


View more >