computer science ph. d. seminar gene ontology (go) based search for protein structure similarity...

28
Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee Members Dr. Debasis Mitra , Dr. Philip Bernhard , Dr. Walter Bond, Dr. Julia Grimwade Date: September 12, 2011

Upload: jared-matthews

Post on 18-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Computer Science Ph. D. Seminar

Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics

Ph.D. CandidateSteve Johnson

Committee Members Dr. Debasis Mitra , Dr. Philip Bernhard , Dr. Walter Bond,

Dr. Julia Grimwade

Date: September 12, 2011

Page 2: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering

Metrics

• GO Background

• GO Subontologies

• GO Annotations

• GO Relationships

• GO Tools

• GO Research

• Research Direction

Page 3: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology Background

The Gene Ontology (GO), http://www.geneontology.org/, provides a consistent vocabulary for describing the attributes of proteins, specifically molecular function, biological process and the cellular component where the protein is found.

Page 4: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology BackgroundGO Consortium

Page 5: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology BackgroundGO Consortium

• GO termso A set of integer IDs (i.e., GO terms) is

assigned to members of the GO Consortium

• GO Consortium members o provide annotationso attend all meetings, o receive funding for supported databases

Page 6: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology Project Facts

• Started in 1998• Primary Goalso Structured Vocabularyo Use to annotate genes and gene products

• 3 Model Organismso FlyBase (Drosophila)o Saccharomyces Genome Database (SGD)o Mouse Genome Informatics (MGI) project

Page 7: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Subontologies

Three Ontology Structure

• Biological Process

• Molecular Function

• Cellular component

Page 8: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Subontologies Biological Process

Biological process refers to the series of steps or sequence of molecular functions.

Examples of biological processes include the following.•Metabolic Process•Photosynthetic Process •Biosynthetic Process

Page 9: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Subontologies Molecular Function

Molecular Function refers to describing the purpose of the gene product and refers to a single function (i.e., unlike biological process).

Examples of molecular function include the following. •Binding Activity •Transport Activity • Receptor Activity

Page 10: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Subontologies Cellular Component

Cellular component refer to identifying the location of the gene product within the structure of the cell. Examples of cellular components include the following.

• Organelle Part • Cell Body Membrane • Apical Complex

Page 11: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Example•Term: Glucose Biosynthetic Process

•ID: GO:0006094

•Definition: The formation of glucose from noncarbohydrate precursors, such as pyruvate, amino acids and glycerol.

GO AnnotationsGO Annotation Terms

Page 12: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Molecular Function 8637 terms Biological Process 17,069 terms Cellular Component 2432 terms

Total 28, 138 terms

GO AnnotationsGO Annotation Term Statistics

As of September 2009

Page 13: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

GO AnnotationsGO Annotation Methods

• Electronic Annotation • Manual Annotation• All annotations

o Sourceo Supportive evidence

Page 14: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Manual Annotation

• Primary source is published literature

• Curators perform sequence similarity analyses to transfer annotations between highly similar gene products (BLAST, protein domain analysis)

GO AnnotationsGO Annotation Methods

Page 15: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Electronic Annotation

• Database entries

o Manual mapping of GO terms to concepts external to GO (‘translation tables’)

o Proteins then electronically annotated with the relevant GO term(s)

• Automatic sequence similarity analyses to transfer annotations between highly similar gene products

GO AnnotationsGO Annotation Methods

Page 16: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

1A71Liver Alcohol

Dehydrogenase

GO AnnotationsGO Annotation Example

Cellular component: Mitochondria GO:0005739

Biological Process:Ethanol Catabolic ProcessGO:0006068

Molecular Function:Oxireductase Activity

Page 17: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

GO AnnotationsSample Annotations

GO Consortium members provide gene annotation data based on information obtained from research quality articles.

The information extracted from the articles are described as “Annotation Sets”

•Sample Annotation Sets

Page 18: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

GO AnnotationsFile Format

The Gene Ontology website represents the annotation data in graphical format. It is part of the Open Biomedical Ontologies (OBO), http://obo.sourceforge.net/.

•Current Species/Database Annotations

•Annotation File Format (GAF 2.0)

Page 19: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

GO AnnotationsEvidence Code Categories

The information in the annotation file includes evidence information which serves as a source to validate /the annotation information.

•Experimental Evidence Codes

•Computational Analysis Evidence Codes

•Author Statement Evidence Codes

•Curator Statement Evidence Codes

Page 20: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

GO AnnotationsGO Slims

GO Slims are subsets of GO annotation information that provide broader classification of terms.

GO Slim Application Example

Page 21: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

GO Relationships

A graph structure is used to establish relationship amongst the terms for molecular function, biological process, and cellular component features.

Primary Ontology Relations

•is a

•part of

•regulates

Page 22: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology BackgroundGO Mappings to EC Numbers

Enzyme Commission numbers are used to specify categories of enzymes based on the chemical reactions catalyzed. The UniProtKB-GOA EC2GO mapping provides GO molecular function IDs for each classification

•EC1 - Oxidoreductases

•EC2 - Transferases

•EC3 - Hydrolases

•EC4 - Lyases

•EC5 – Isomerases

•EC 6 - Ligases

Page 23: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

GO Tools

•Amigo•OBO – Edit•QuickGO•Goanna•agriGO

Page 24: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology Database

•MySQL•Querying GO MySQL

oSQLoPerloGHOUL

Page 25: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology Interesting Research

•GO Annotation Consistency•Automated Annotation

•Biocreative•CLUGO•Similarity Prediction Method

•Automated Protein Function Predictions•Search for Genes w/ Similar Function•Semantic Similarity

Page 26: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Dissertation Research Hypothesis

There exists protein alignment metrics/algorithms that can be used as clustering indexes for proteins with matching GO molecular functions IDs

Page 27: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology References

Evelyn B Camon, Daniel G Barrell, Emily C Dimmer, Vivian Lee, Michele Magrane, John Maslen, David Binns and Rolf Apweiler; An evaluations of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatices 2005. 6 (Supplement 1): S17.

Mary E. Dolan, Li Ni, Evelyn Camon and Judith A. Blake; A procedure for assessing GO annotation consistency. Bioinformatics 2005. 21 (Supplement 1): i136 – i143.

In-Yee Lee, Jan-Ming Ho, Ming-Syan Chen; CLUGO: A Clustering Algorithm for Automated Functional Annotations Based on Gene Ontology. Proceedings of the 5th IEEE International Conference on Data Mining (ICDM, 05): i136 – i143.

Gene Ontology Consortium; The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Research, 2009.

Evelyn Camon, Michele Magrane, Daniel Barrell, Vivian Lee, Emily Dimmer, John Maslen, David Binns, Nicola Harte, Rodrigo Lopez and Rolf Apweiler; The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research, 2004 (32).

Page 28: Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee

Gene Ontology References

Gene Ontology Consortium; The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 2004 (32).

Seth Carbon1, Amelia Ireland2, Christopher J. Mungall, ShengQiang Shu, Brad Marshall, Suzanna Lewis; Amigo: online access to ontology and annotation data. Bioinformatics Application Note. 22 (2), 2009: 288 – 289.