standardizing metadata associated with niaid genome sequencing center projects

23
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center

Upload: tomai

Post on 23-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects . Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center. NIAID Bioinformatics Resource Centers. www.pathogenportal.net. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Richard H. Scheuermann, Ph.D.Department of Pathology

Division of Biomedical InformaticsU.T. Southwestern Medical Center

Page 2: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

NIAID Bioinformatics Resource Centerswww.pathogenportal.net

Page 3: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Influenza Research Databasewww.fludb.org

Page 4: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

NIAID Genome Sequencing Centers

Page 5: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Metadata Inconsistencies

• Each project was providing different types of metadata

• No consistent nomenclature being used• Impossible to perform reliable comparative

genomics analysis

Page 6: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Dengue Clinical Metadata

Page 7: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Complex Query Interface

Page 8: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Additional Clinical Characteristics

Page 9: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

GSC-BRC Metadata Standards Working Group

• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs

• Develop metadata standards for pathogen isolate sequencing projects

Page 10: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

GSC-BRC Metadata Working Groups

Page 11: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data

fields that appear to be project specific• For each data field, provide:

– definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories,– data providers

• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples• Establish policies and procedures for metadata submission workflows and GenBank linkage• Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

Page 12: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Core Sample Metadata

30 Core Sample Metadata Fields

Page 13: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Core Project Metadata

16 Core Project Metadata Fields

Page 14: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Metadata Standards Process• Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors• Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS)• Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data

fields that appear to be project specific• For each data field, provide:

– definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories,– data providers

• Merge subgroup core elements into a common set of core metadata fields and attributes• Assemble metadata fields into a semantic network (Scheuermann)• Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) (Stoeckert, Zheng)• Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples• Establish policies and procedures for metadata submission workflows and GenBank linkage• Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

Page 15: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

organism

environmentalmaterial

equipment

person

specimensource role

specimencapture role

specimencollector role

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen Xspecimen isolationprocedure X

isolationprotocol

has_input

has_output

plays

plays

has_specification

has_partdenotes

located_in

name

denotes

spatialregion

geographiclocation

denoteslocated_in

affiliation

has_affiliation

ID

v2

v5-6

v3-4

v7v8

v15

v16

denotes

specimen typeinsta

nce_of

specimen isolationprocedure type

instance_of

Specimen Isolation

plays

has_input

Comments

????

v9

organism parthypothesis v17

is_about

IRB/IACUCapproval

has_authorization

v19v18

b18

b22environmenthas_quality

b23

b24

b28 b29

b25 b26 b27

b30

Page 16: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Metadata Processes

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID qualities

temporal-spatialregion

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_input

has_output

has_output

has_specification has_part has_part

is_about

has_input

has_output

has_input

has_input

has_input

has_output

has_output

has_output

is_about

GenBankID

denotes

located_in

denotes

has_input

has_qualityinstance_of

temporal-spatialregion

located_in

Specimen Isolation

Material Processing

Data ProcessingSequencing Assay

Investigation

Page 17: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Core-Project

Page 18: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Core-Specimen

Page 19: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

assay X

samplematerial X

material X

person X

equipment X

lot #

primarydata

assayprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

Generic Assay

has_part

located_indenotes denotes

runID

assaytype

denotes

instance_of

reagentrole

reagenttype

instance_

of

denotes

sample ID

playstarget

role

sampletype

instance_

of

denotes

name

playstechnician

role

species

instance_

of

denotes

serial #

playssignal

detection role

equipmenttype

instance_

of

denotes

has_input

has_input

has_input

objectives

has_part

analyte X

has_part

quality x

has_quality

input samplematerial X

is_about

Page 20: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

materialtransformation X

samplematerial X

material X

person X

equipment X

lot #

outputmaterial X

material transformationprotocol

temporal-spatialregion

has_input

located_in

has_specification

has_output

plays

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

Generic Material Transformation

has_part

located_indenotes denotes

runID

material transformationtype

denotes

instance_of

reagentrole

reagenttype

instance_

of

denotes

sample ID

playstarget

role

sampletype

instance_

of

denotes

name

playstechnician

role

species

instance_

of

denotes

serial #

playssignal

detection role

equipmenttype

instance_

of

denotes

has_input

has_input

has_input

objectives

has_part

quality x

has_quality

quality x

materialtype

has_quality

instance_of

sample IDdenotes

Page 21: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

data transformation Xinputdata

outputdata

material X

algorithm

has_specification

has_output

is_about

software

has_input

located_in

person Xname

data analystrole

denotes

runID

denotes

Generic Data Transformation

temporal-spatialregion

spatialregion

temporalinterval

GPSlocationdate/time

spatialregion

geographiclocation

has_part

located_indenotes denotes

data transformationtype

instance_of

plays

Page 22: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Generic Material (IC)

material X

ID

materialtype

quality x

has_quality

material Y

has_part

material Z

has_part

quality y

has_quality

denotes

instance_of

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

has_part

located_in

spatialregion

geographiclocation

denotes denotes denotes

located_in located_in

Page 23: Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects

Conclusions

• Utility of semantic representation– Identified gaps in data field list (e.g. temporal components)– Identified gaps in ontology data standards (use case-driven standard development)– Identified commonalities in data structures (reusable)– Support for semantic queries and inferential analysis in future

• Two flavors of MIBBI– Distinguish between minimum information to reproduce an experiment and the

minimum information to structure in a database for query and analysis• OBI-based framework is re-usable

– Sequencing => “omics”• Practical issues about implementation strategies

– Challenge of using ontologies for preferred value sets• Can be large• May not directly match common language