session chair: richard scheuermann (vipr & ird) brc2011 session #5 – data standards and...
TRANSCRIPT
S E S S I O N C H A I R : R I C H A R D S C H E U E R M A N N ( V I P R & I R D )
BRC2011Session #5 – Data Standards and
Metadata
Session #5 - Outline
MotivationOpportunities, Challenges and Talking Points
minimum information checklists ontology-based value sets use cases for metadata SOPs for data & metadata acquisition
Ontology of Biomedical Investigations – Bjoern PetersInfectious Disease Ontology and extensions – Lindsay
CowellGSCID-BRC Metadata Working Group effortsOpen discussion
Why Data Standards
Interoperability - the ability to exchange information between people, organizations, machines
Comparability - the ability to ascertain the equivalence of data from different sources
Data Quality – asses the completeness, accuracy and precision of the data
Dependability – ensures that you get what you expect from a database query
Accurate Statistical AnalysisInference
What Data Standards
Minimum Information Sets – what needs to be describedStructured Vocabulary/Ontology – how to describe them
Term strings – unique identifiers Definitions - what terms mean Syntax - how terms are used
Semantics - how the components relate to each other
Session #5 – Challenges
Status of relevant data standards Few data standards that have been widely adopted by the infectious diseases community Some standards are being development without engagement of all relevant stakeholders If we drive standards development, how do we get broad adoption
Adoption of data standards by data providers Even if vocabulary standards are available, how do we get the broader community to use them How do we educate them to use the data standards accurately How to keep the barrier low for getting required meta-data in a standard format
Technical challenges Usability is constrained by spreadsheet interface Ontology-based controlled vocabularies sometimes too large for spreadsheet like interface or drop down lists While web-based GUI smart forms are good for single submission, difficult to design them to scale
Need for quality control and curation If data standards are not enforced, mapping to standards may be required Problems with homonyms (Turkey vs turkey) and synonyms (Puerto Rico and PR) Not all tasks in metadata collection lend themselves to automation Data entry quality control mechanisms are especially limited because of spreadsheet functionality Could be 1-2 FTEs; not budgeted
Compliance with HIPAA and other privacy regulations. PATRIC does not anticipate working with identifying data but GSCIDs and investigators could be delayed by compliance issues
Special cases Metadata for genomes for NBCI bulk submission and non-unique taxon ids. Metadata for growth conditions to be used with transcript datasets Metadata for metagenomes to correlate genomes and proteins with useful info about sites and conditions
How to we effectively exploit standardized data and metadata
Session #5 – Opportunities
Existing relevant ontologies are in decent shape – GO, IDO, OBI Ontology for Biomedical Investigations (OBI) can provide a common framework for describing
and exchanging datasets GSCID-BRC Metadata Working Group Leverage and harmonize with MIGS/MIMS We have the opportunity to establish policies for metadata collection, exchange, and release that
would be broadly applicable. We are in the position to drive standards adoption The BRCs support many pathogens that infect the same host(s) … can we exploit this fact to create
specialized views and tools for interacting with the host resources from both pathogen and host perspectives?
Ontology-driven integration (GMOD, Population biology) Small sequencing centers
Offer community a standard metadata template for isolates Bring your own data and metadata to PATRIC for annotation, analysis, long term metadata storage and dissemination
Develop additional metadata standards and collect, store, and share additional metadata More efficient encoding of things like alignments
Presentations
Ontology of Biomedical Investigations (OBI) – Bjoern Peters
Infectious Disease Ontology (IDO) and extensions – Lindsay Cowell
GSCID-BRC Metadata Working Group
GSCID-BRC Metadata Working Group
Working group established to define common metadata standard for pathogen isolate sequencing projects
Collaboration between BRCs, GSCIDs and NIAID Process
Collect spreadsheets, metadata examples, previous submission from sequencing projects Core metadata fields collected by virus, bacteria and eukaryote subgroups For each metadata field, propose:
preferred term definition synonyms allowed values based on controlled vocabularies preferred syntax responsible provider data category examples
Merge recommendations from subgroups into a common core metadata using an OBI-based semantic framework Develop recommendations for project-specific and pathogen-specific metadata fields Harmonize with other relevant standards (MIGS/MIMS, IDO) Establish policies and procedures for metadata submission workflows and GenBank linkage
data transformations –image processing
assemblysequencing assay
specimen source – organism or environmental
specimencollector
input sample
reagents
technician
equipment
type ID qualities
temporal-spatialregion
data transformations –variant detection
serotype marker detect.gene detection
primarydata
sequencedata
genotype/serotype/gene data
specimen
microorganism
enrichedNA sample
microorganismgenomic NA
specimen isolationprocess
isolationprotocol
sample processing
data archivingprocess
sequencedata record
has_in
put
has_o
utput
has_o
utput
has_specification has_part has_part
is_about
has_input
has_o
utput
has_in
put
has_in
put
has_in
put
has_o
utput
has_o
utput
has_o
utput
is_about
GenBankID
denotes
located_in
denotes
- independent continuant
- dependent continuant
- occurrent
- temporal-spatial region
ital - relations
has_in
put
has_qualityinstance_of
temporal-spatialregion
located_in
Network Overview
data transformations –image processing
assemblysequencing assay
specimen source – organism or environmental
specimencollector
input sample
reagents
technician
equipment
type ID qualities
temporal-spatialregion
data transformations –variant detection
serotype marker detect.gene detection
primarydata
sequencedata
genotype/serotype/gene data
specimen
microorganism
enrichedNA sample
microorganismgenomic NA
specimen isolationprocess
isolationprotocol
sample processing
data archivingprocess
sequencedata record
has_in
put
has_o
utput
has_o
utput
has_specification has_part has_part
is_about
has_input
has_o
utput
has_in
put
has_in
put
has_in
put
has_o
utput
has_o
utput
has_o
utput
is_about
GenBankID
denotes
located_in
denotes
has_in
put
has_qualityinstance_of
temporal-spatialregion
located_in
Specimen Isolation
Material Processing
Data ProcessingSequencing Assay
Investigation
data transformations –image processing
assemblysequencing assay
organism
environmentalmaterial
equipment
person
samplematerial
material
person
equipment
templaterole
reagentrole
sequencingtech. role
signaldetection role
specimensource role
specimencapture role
specimencollector role
species/strain
organismID
age, gender,symptom
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
cDNAsample
data transformations –variant detection
serotype marker detect.gene detection
primarydata
sequencedata
genotype/serotype/gene data
specimen
microorganism
enrichedNA sample
microorganismgenomic NA
specimen isolationprocess
isolationprotocol
NA enrichmentprocess
NA enrichmentprotocol
cDNA synthesisprocess
cDNA synthesisprotocol
sequencingprotocol algorithm
temporal-spatialregion
data archivingprocess
sequencedata record
has_in
put
has_in
put
has_in
put
has_o
utput
has_o
utput
has_o
utput
plays
plays
has_specification has_part has_specification has_part has_specification
loca
ted_in has_part
denot
es
is_about
has_input
has_o
utput
has_in
put
plays
located_in
has_specification has_specification
has_in
put
has_in
put
has_o
utput
has_o
utput
has_o
utput
is_about
GenBankID
denotes
located_in
software
has_input
data transferprotocol
has_specification
commonname
denotes
denotes
has_qualityinstance_of
name
denotes
spatialregion
geographiclocation
denot
eslocated_in
affiliation
has_affiliation
species/strain
instance_of
ID ID ID
amount
has_quality
v2
v5-6
v3-4
v7v8
v10
v12
v11
v13
v15
v16
v22 v25
v23
v24
v27v30 v32
v29 v31 v43
v40
v42
v45
v46
v44
vX – row X in virus sheet
- independent continuant
- dependent continuant
- occurrent
- temporal-spatial region
ital - relations
Metadata Categories
InvestigationSpecimen IsolationSpecimen ProcessingSample ShipmentPathogen Detection & IsolationSequencing Sample PreparationSequencing AssayData Transformation
organism
environmentalmaterial
equipment
person
specimensource role
specimencapture role
specimencollector role
species/strain
organismID
age, gender,symptom
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen X
microorganism
specimen isolationprocedure X
isolationprotocol
has_in
put
has_o
utput
plays
plays
has_specification
has_part
has_partden
otes
located_in
commonname
denotes
denotes
has_qualityinstance_of
name
denotes
spatialregion
geographiclocation
denot
eslocated_in
affiliation
has_affiliation
species/strain
instance_of
ID
v2
v5-6
v3-4
v7v8
v10
v12
v11
v13
v15
v16
v27
denotes
specimen typein
stan
ce_o
f
specimen isolationprocedure type
instance_of
Specimen Isolation
plays
has_in
put
Comments
????
v9
organism parthypothesis
v17
is_about
IRB/IACUCapproval
has_authorization
v19v18
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_partden
otes
spatialregion
geographiclocation
denot
eslocated_in
located_in
vX – row X in virus sheet
- independent continuant
- dependent continuant
- occurrent
- temporal-spatial region
ital - relations
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen X
microorganism X
sampleset X
sample setassembly process X
sample setassembly protocol
has_o
utputhas_part
has_specification
has_part
loca
ted_in
spatialregion
geographiclocation
species/strain
instance_of
ID
v15
v16
v27
Specimen Processing
aliquotingprocess X
aliquotingprotocol
has_in
put
has_o
utput
has_specification
specimen Xaliquot Y
specimentypeamount
denotes
instance_ofhas_quality
ID
specimentypeamount
denotes
instance_ofhas_quality
ID
specimentypeamount
denotes
instance_ofhas_quality
located_in located_in
sample setassembly process
aliquotingprocess
instance_of instance_of
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
loca
ted_in
spatialregion
geographiclocation
denotes denotes denotes
specimen Aaliquot B
specimen Maliquot N
specimen Taliquot U
has_in
put
v20v22
v23v24
sample set Xat GSC
sample set Xin transit
sample shipmentprocess X
sample shipmentprotocol
sample receiptprocess X
sample receiptprotocol
has_in
put
has_in
put
has_o
utput
has_o
utput
has_specification has_specification
Sample Shipment
sampleset X
ID
sample settypeamount
denotes
instance_ofhas_quality
ID
sample settypeamount
denotes
instance_ofhas_quality
ID
sample settypeamount
denotes
instance_ofhas_quality
located_in located_insample shipmentprocess
sample receiptprocess
instance_of instance_of
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
loca
ted_in
spatialregion
geographiclocation
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
loca
ted_in
spatialregion
geographiclocation
denotes denotes denotes
v21
sample Xat GSC
ID
sampletypeamount
denotes
instance_ofhas_quality
has_p
art
v24v23
v25
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
specimen X
microorganism X
has_part
has_part
loca
ted_in
spatialregion
geographiclocation
species/strain
instance_of
IDv15
v16
v27
Pathogen Detection & Isolation
pathogen detectionprocess X
has_in
put
has_specification
data aboutpathogen presence
specimentype
amount
denotes
instance_of
has_quality
located_in
pathogen detectionmethod
instance_of
denotes denotes denotes
pathogen detectionprotocol
has_output
v28
v26
is_ab
out
v34
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
loca
ted_in
spatialregion
geographiclocation
pathogen isolationprocess X
located_in
pathogen isolationmethod
denotes denotes denotes
pathogen detectionprotocol
has_input
inst
ance
_of
has_s
pecifi
catio
n
pathogenisolate X
ID
pathogentypeamount
denotes
instance_ofhas_quality
has_output
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
cDNAsample X
specimen X
microorganism X
enrichedNA sample X
microorganismgenomic NA
NA enrichmentprocess X
NA enrichmentprotocol
cDNA synthesisprocess X
cDNA synthesisprotocol
has_in
put
has_in
put
has_o
utput
has_o
utputhas_part
has_specification
has_part
has_specification
has_part
loca
ted_in
spatialregion
geographiclocation
species/strain
instance_of
ID
ID
v15
v16
v27
Sequencing Sample Preparation
aliquotingprocess X
aliquotingprotocol
has_in
put
has_o
utput
has_specification
specimenaliquot X
specimentypeamount
denotes
instance_ofhas_quality
ID
specimentypeamount
denotes
instance_ofhas_quality
ID
specimentypeamount
denotes
instance_ofhas_quality
ID
specimentypeamount
denotes
instance_ofhas_quality
located_in located_in located_in
NA enrichmentprocess
cDNA synthesisprocess
aliquotingprocess
instance_of instance_of instance_of
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
loca
ted_in
spatialregion
geographiclocation
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
loca
ted_in
spatialregion
geographiclocation
denotes denotes denotes
v35
v36
v37
v38
v39
v33
sequencing assay X
samplematerial X
material X
person X
equipment X
lot #
primarydata
sequencingprotocol
temporal-spatialregion
has_in
put
located_in
has_specification
has_o
utput
v40
plays
spatialregion
temporalinterval
GPSlocation
date/time
spatialregion
geographiclocation
Sequencing Assay
has_part
loca
ted_indenotes denotes
runID
sequencingassay type
denotes
insatnce_of
reagentrole
reagenttype
inst
ance
_of
denot
es
sample ID
playstemplate
role
sampletype
inst
ance
_of
denot
es
name
playssequencing
tech. role
species
inst
ance
_of
denot
es
serial #
playssignal
detection role
equipmenttype
inst
ance
_of
denot
es
has_in
put
has_in
put
has_in
put
v14
v41
objectives – coverage,genome type targeted
has_part
data transformations –image processing
assembly X
data transformations –variant detection
primarydata
sequencedata
genotype data
microorganism X
microorganismgenomic NA
algorithm
data archivingprocess
sequencedata record
has_input
inst
ance
_of
has_specification
has_in
put
has_o
utput
has_o
utput
is_about
GenBankID
denotes
software
has_input
data transferprotocol
has_specification
species/strain
has_output
has_in
put
temporal-spatialregion
located_in
spatialregion
temporalinterval
GPSlocation
date/time
spatialregion
geographiclocation
has_part
loca
ted_indenotes denotes
person Xname
plays
bioinformaticstech. role
species
inst
ance
_of
denot
es
runID
denoteslocated_in
data transformations –serotype marker
detection
serotype data
data transformations –gene detection
gene data
part_of
has_output
has_output
is_ab
out
has_input
has_input
Data Transformationstemporal-spatial
region
spatialregion
temporalinterval
GPSlocation
date/time
spatialregion
geographiclocation
has_part
loca
ted_indenotes denotes
v29
v43
v31
v32
v42
v30
v44
v45 v46
v47
Investigation- independent continuant
- dependent continuant
- occurrent
- temporal-spatial region
ital - relations
investigation
study design
has_part
documenting
study design execution
has_part
has_part
objective specification
has_part
data transformation
has_parthas_part
Information content entity
has_specified_input
specimen creation
specimen preparation
for assay
sequencing assay
has_part has_part
assay X
samplematerial X
material X
person X
equipment X
lot #
primarydata
assayprotocol
temporal-spatialregion
has_in
put
located_in
has_specification
has_o
utput
plays
spatialregion
temporalinterval
GPSlocation
date/time
spatialregion
geographiclocation
Generic Assay
has_part
loca
ted_indenotes denotes
runID
assaytype
denotes
instance_of
reagentrole
reagenttype
inst
ance
_of
denot
es
sample ID
playstarget
role
sampletype
inst
ance
_of
denot
es
name
playstechnician
role
species
inst
ance
_of
denot
es
serial #
playssignal
detection role
equipmenttype
inst
ance
_of
denot
es
has_in
put
has_in
put
has_in
put
objectives
has_part
analyte X
has_part
quality x
has_quality
input samplematerial X
is_ab
out
materialtransformation X
samplematerial X
material X
person X
equipment X
lot #
outputmaterial X
material transformationprotocol
temporal-spatialregion
has_in
put
located_in
has_specification
has_o
utput
plays
spatialregion
temporalinterval
GPSlocation
date/time
spatialregion
geographiclocation
Generic Material Transformation
has_part
loca
ted_indenotes denotes
runID
material transformationtype
denotes
instance_of
reagentrole
reagenttype
inst
ance
_of
denot
es
sample ID
playstarget
role
sampletype
inst
ance
_of
denot
es
name
playstechnician
role
species
inst
ance
_of
denot
es
serial #
playssignal
detection role
equipmenttype
inst
ance
_of
denot
es
has_in
put
has_in
put
has_in
put
objectives
has_part
quality x
has_quality
quality x
materialtype
has_quality
instance_of
sample IDden
otes
data transformation Xinputdata
outputdata
material X
algorithm
has_specification
has_o
utput
is_about
software
has_in
put
located_in
person Xname
data analystrole
denot
es
runID
denotes
Generic Data Transformation
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
spatialregion
geographiclocation
has_part
loca
ted_indenotes denotes
data transformationtype
instance_of
plays
Generic Material (IC)
material X
ID
materialtype
quality x
has_quality
material Y
has_part
material Z
has_part
quality y
has_quality
denotes
instance_of
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
loca
ted_in
spatialregion
geographiclocation
denotes denotes denotes
temporal-spatialregion
spatialregion
temporalinterval
GPSlocation
date/time
has_part
loca
ted_in
spatialregion
geographiclocation
denotes denotes denotes
located_in located_in
Discussion Points
MIBBI may not be sufficient Don’t distinguish between minimum information to reproduce and experiment and the
minimum information to structure in a database Lack a semantic framework
OBI-based framework is re-usable Sequencing => “omics”
Challenge of using ontologies for preferred value sets Can be large May not directly match common language
Value of defining the semantic framework Appropriate relations are retained How can we take advantage of the framework for semantic query and inferential
analysis?
Practical issues about implementation strategies