managing big scientific data capturing, integrating and presenting mouse data at mgi cynthia smith...

26
Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010 www.informatics.jax.org Mouse Genome Informatics

Upload: hubert-craig

Post on 20-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Managing Big Scientific Data Capturing, Integrating and Presenting

Mouse Data at MGI

Cynthia SmithCanberraApril 2010www.informatics.jax.org

Mouse Genome Informatics

Page 2: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

• short domed skull• short-limbed dwarfism• malocclusion• bulging abdomen as adults• respiratory problems• shorted lifespan

Achondroplasia

Homozygous achondroplasia mouse mutant and control

Mouse Genome Informatics (MGI) program goal

…to facilitate the use of the mouse as a model for heritable human diseases and normal human biology.

Page 3: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

…to accomplish MGI’s mission, we provide integrated access to the genetics, genomics, and biology of the laboratory mouse.

Hermansky-Pudlak syndrome

Information content spans from sequence to phenotype/disease

sequence

natural variation

gene function

genome location

orthologies

strain geneaologyexpression

tumors

Page 4: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

MGI Data Content, a few numbers

April, 2010

Genes (including uncloned mutants) 36,691

Genes w/ nucleotide sequence 29,108

Genes annotated to GO Total mouse GO annotations

25,620223,558

Mouse/human orthologs 17,846

Mouse/rat orthologs 16,776

Phenotypic alleles in mice genes with mutant alleles in mice mutant alleles in cell lines only total phenotype annotations (Mamamlian Phenotype-MP)

24,00712,427541,172182,139

QTL 4,436

Human diseases w/ one or more mouse model

1005

Gene Expression Assays 37,584

Integrated mouse nucleotide sequences+ESTs

refSNPs

>9,994,000

>10,089,000

References 153,161…plus strains, expression and phenotype images, tumor records, etc.

Page 5: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Integration in MGI

• Identify objects.• Resolve discrepancies. Integration is Integration is

key to key to knowledge knowledge discoverydiscovery

Page 6: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Integration is hard…not just a matter of combining data sources…

• Data from multiple sources can be of differing quality

• The same data can enter the system via various paths

• Naming conventions may or may not be to standards

• Some data sources don’t maintain unique accession numbers (or allow them to change)

• Periodic updates from data sources can cause problems• if objects have disappeared… (or reappear)• If objects have split in two

Page 7: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

• Data Acquisition

• Object Identity

• Standardizations

• Data Associations

• Integration with other bioinformatics resources

Literature &

Loads

New Gene, Strain or Sequence?

Controlled Vocabularie

s

Evidence & Citation

Co-curation of shared objects and concepts

Annotation Pipeline

Page 8: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Data integration is hard

• “Bucketizing” establishe types of correspondence between objects in the input sets.

• Allows immediate incorporation of 1:1 corresponding data.

• Sorts conflicting data into bins that allow prioritization for curator resolution.

Page 9: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

VEGA annotated three distinct genes instead of multiple transcripts for a single gene (Mvk)

chr5:114705285-114721583

Page 10: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Why resolve and integrate data?1. Allows you to find all the data:Example: I want all the sequences from GenBank that are from C57BL/6

There are >100 different versions of this strain name in GenBank files, e.g.B6 BL/6 C57BL76J 57BL/6JBlack-6 JB6 C57Black/6 black six …..ETC…

Example: You find several papers describing different phenotypes of knockouts of the Fgfr2 gene. The knockout alleles are just called Fgfr2-/-. Help!

There are 14 different targeted alleles of Fgfr2 (knockout/knockin, each has a unique symbol and MGI-ID, different phenotype annotations, and are models of different human diseases). All are associated with their respective references.

MGI has curated these data. You can ask these questions!

Page 11: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Why resolve and integrate data?2. Allows you to discriminate ambiguous data

Example: I want information for mouse gene Tap

Which gene? There are 5 genes published as Tap. Each of these genes has Tap as a synonym.

Chr 15 Ly6a , lymphocyte antigen 6 complex, locus A Chr 19 Nxf1, nuclear RNA export factor 1 homolog (S. cerevisiae) Chr 11 Sec14l2, SEC14-like 2 (S. cerevisiae) Chr 17 Tap1, transporter 1, ATP-binding cassette, sub-family B (MDR/TAP) Chr 5 Uso1, USO1 homolog, vesicle docking protein (yeast)

P.S. Gene Gnas has 20 synonyms

Page 12: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Why resolve and integrate data?3. In addition to object identification issues, integration allows you ask complex questions that span data sets and data types from different sources:Example: What genes on Chromosome 11 have mutant alleles that display phenotypes of hydrocephaly and hypertension?

Example:Provide me with a list of Refseq IDs where the gene corresponding to the sequences show expression in embryos at 13.5-15 days and are involved in the biological process (GO) of apoptosis.

Page 13: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Integration requires consistent semantics

Controlled vocabularies/nomenclatures• Strains• Genes• Alleles (phenotypic or variant)• Classes of genetic markers• Types of mutations• Types of assays• Developmental stages• Tissues• Clone libraries• ES cell lines

….. organized as lists or simple hierarchies

Page 14: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Ldb1 (LIM domain binding 1) gene expression in CD-1 mice

Assay TypeGene nomenclature

StrainAge

Results

Page 15: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Semantics plus relationship data

Ontologies/structured vocabularies• Gene Ontology (GO)

• Molecular function• Biological process• Cellular component

• Mouse Anatomy (MA)• Embryonic• Adult

• Mammalian Phenotype (MP)

• Sequence Ontology (SO)

….. organized as directed acyclic graphs (DAGs)

DAGs

Page 16: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Mammalian Phenotype Ontology

• Structured as DAG• Over 7324 terms

covering physiological systems, behavior, development and survival

• Available in browser and in OBO file formats from MGI ftp and OBO Foundry sites

Page 17: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome
Page 18: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

P05147

PMID: 2976880

GO:0047519IDA

P05147 GO:0047519 IDA PMID:2976880

GO Term

Reference

Evidence

Annotating Gene Products using GO

Gene Product

Page 19: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Data sources

Primary literature

Centers: mutagenesis, gene

trap, etc

Data Loads: GenBank, SNPs, clone collections, UniProt, RIKEN, IKMC,etc

Electronic Submissions (individual labs)

Processing, QC, and curation

• Gather data from multiple sources• Factor out common objects• Assemble integrated objects

Page 20: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Data sourcing for MGI• Data from major providers (e.g. Ensembl, UniProt) and from data project Centers (e.g. gene trap, ENU mutagenesis centers) are generally reliably formatted, though data may still have QC issues. Occasional changes in format can be frustrating.

• Data from individual research labs vary greatly in file formats and adherence to nomenclature & usually are handled on a case-by-case basis.

• Scientific literature is a reflection of individual labs (largely), & must be treated as using non-standard nomenclatures – but awareness is improving!

Page 21: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Data sourcing for MGI (…wishes)• more user contributions

• pre-publication nomenclature assignments• data submissions(data can be held private until publication)

• journal permissions for images - have some

• in progress (collaborations on raw phenotype data exchange with European and Japanese mouse mutagenesis and knockout groups)

Page 22: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Building a mouse phenotyping data resource• Large scale ENU mutagenesis programs worldwide -

continuing

• Large scale gene trap programs (International Gene Trap Consortium) www.genetrap.org - gene trap cell lines loaded, with Lexicon

• International Mouse Knockout Consortium• KOMP – Knockout Mouse Project (USA) www.knockoutmouse.org• EUCOMM – European Conditional Mouse Mutagenesis www.eucomm.org • NorCOMM – North American Conditional Mouse Mutagenesis http://norcomm.phenogenomics.ca • Texas Institute for Genomic Medicine Knockouts www.tigm.org

• Collaborative Cross www.complextrait.org

• Literature and lab submissions

• New recombinase (cre, flp, etc) and reporter database is online and data is being populated

Page 23: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

BREADTH: Large scale screen for potential phenotypic outliers

DEPTH: Phenotypic description of mutant genotype(s)

Page 24: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

SUMMARYIntegration in MGI • is accomplished through a combination of automatic & semi-automatic loads & QC processing, followed by manual curation.

• requires applying semantic consistency using standard nomenclatures, ontologies and structured vocabularies.

• provides users with the ability to find data that would otherwise not be found or ambiguous.

• allows complex questions spanning different data sets and data areas to be asked.

Page 25: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

SUMMARY

Data Sourcing in MGI • includes data from major genome resources

and mouse centers, as well as individual lab submissions and curated information from scientific literature.

• requires QC processing for format consistency; for some (individual) labs case-by-case assistance.

• for new large-scale phenotyping activities, integrate data with common curation of MP ontology; connect with raw data (international collaboration).

• continue to work with community and journals to allow easier data access.

Page 26: Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI Cynthia Smith Canberra April 2010  Mouse Genome

Bar Harbor, Maine

MGI is funded by:NHGRI grants HG000330, HG002273, HG003622NICHD grant HD033745NCI grant CA089713

www.informatics.jax.org