using ontology to classify members of a protein family

29
Using Ontology to Classify Members of a Protein Family Robert Stevens BioHealth Informatics Group School of Computer Science University of Manchester [email protected]

Upload: robertstevens65

Post on 21-May-2015

107 views

Category:

Science


2 download

DESCRIPTION

Invited talk at Cambridge Chemistry Department

TRANSCRIPT

Page 1: Using Ontology to Classify Members of a Protein Family

Using Ontology to Classify Members of a Protein Family

Robert StevensBioHealth Informatics Group

School of Computer Science

University of [email protected]

Page 2: Using Ontology to Classify Members of a Protein Family

Introduction• Developing an automated system for extracting and

classifying proteins from newly sequenced genomes• Building an OWL ontology that defines class

membership• Describing protein instances in OWL• Classifying against the ontology• Describing the protein family complement of a

genome• As good as human classification, but added value• Only possible through inter-disciplinary research

Page 3: Using Ontology to Classify Members of a Protein Family

Acknowledgements (it takes all sorts)

Katy Wolstencroft (Bioinformatics)

Daniele Turi (Instance Store)

Phil Lord (myGrid)

Lydia Tabernero (Protein Scientist)

Matt Horridge, Nick Drummond et al (Protégé OWL)

Andy Brass and Robert Stevens (Bioinformatics)

Page 4: Using Ontology to Classify Members of a Protein Family

Protein Classification• Proteins divided into broad functional classes

“Protein Families”• Families sub-divided to give family

classifications• Class membership cam be determined by

“protein features”, such as domains, etc.• Resources exist for feature detection via

primary sequence– but not class membership

• Current Limitation of Automated Tools• Needs human knowledge to recognise class

membership

Page 5: Using Ontology to Classify Members of a Protein Family

Finding Domains on a Sequence

A search of the linear sequence of protein tyrosine phosphatase type K – identified 9 functional domains

>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).

MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV…

……..

Page 6: Using Ontology to Classify Members of a Protein Family

Why Classify?• Classification and curation of a genome is the

first step in understanding the processes and functions happening in an organism

• Classification enables comparative genomic studies - what is already known in other organisms

• The similarities and differences between processes and functions in related organisms often provide the greatest insight into the biology

• In silico characterisation is the current bottleneck

Page 7: Using Ontology to Classify Members of a Protein Family

The Protein Phosphatases

• large superfamily of proteins – involved in the removal of phosphate groups from molecules

• Important proteins in almost all cellular processes

• Involved in diseases – diabetes and cancer• human phosphatases well characterised

Page 8: Using Ontology to Classify Members of a Protein Family

Phosphatase Classification

• Diagnostic phosphatase domains/motifs – sufficient for membership of the protein phosphatase superfamily

• Any protein having a phosphatase domain is a member of the phosphatase super-family

• Other motifs determine a protein’s place within the family

• Usually needs human to recognise that features detected imply class membership

• Can these be captured in an ontology?

Page 9: Using Ontology to Classify Members of a Protein Family

Ontologies

• Describing and defining the classes of objects represented in information

• Defining the characteristics of objects

• The characteristics by which it can be recognised to which class an object belongs

• In a form understandable by a computer

• … and, of course, humans.

Page 10: Using Ontology to Classify Members of a Protein Family

Web Ontology Language (OWL)

• W3C recommendation for ontologies for the Semantic Web

• OWL-DL mapped to a decidable fragment of first order logic

• Classes, properties and instances• Boolean operators, plus existential and

universal quantification• Rich class expressions used in restriction on

properties – hasDomain some (ImnunoGlobinDomain or FibronectinDomain)

Page 11: Using Ontology to Classify Members of a Protein Family

OWL represents classes of instances

A

BC

Page 12: Using Ontology to Classify Members of a Protein Family

Necessity and Sufficiency

• An R2A phosphatase must have a fibronectin domain

• Having a fibronectin domain does not a phosphatase make

• Necessity -- what must a class instance have? • Any protein that has a phosphatase catalytic

domain is a phosphatase enzyme• All phosphatase enzymes have a catalytic domain• Sufficiency – how is an instance recognised to be a

member of a class?

Page 13: Using Ontology to Classify Members of a Protein Family

Definition of Tyrosine Phosphatase

Class TyrosineRreceptorProteinPhosphatase

EquivalentTo: Protein That- contains atLeast-1

ProteinTyrosinePhosphataseDomain and- contains EXACTLY 1

TransmembraneDomain

Page 14: Using Ontology to Classify Members of a Protein Family

…there are known knowns; there are things we know we know. We also know there are

known unknowns; that is to say we know there are some things we do not know. But

there are also unknown unknowns -- the ones we don't know we don't know.

Page 15: Using Ontology to Classify Members of a Protein Family

Definition of Tyrosine Phosphatase: What we Know we Know

Class TyrosineRreceptorProteinPhosphatase

EquivalentTo: Protein That- contains atLeast-1

ProteinTyrosinePhosphataseDomain and- contains EXACGTLY 1

TransmembraneDomain

Page 16: Using Ontology to Classify Members of a Protein Family

Definition for R2A Phosphatase

Class: R2AEquivalentTO: Protein That- contains 2 ProteinTyrosinePhosphataseDomain and- (contains 1 TransmembraneDomain )and - (contains 4 FibronectinDomains) and- contains 1 ImmunoglobulinDomain and- contains 1 MAMDomain and- contains 1 Cadherin-LikeDomain and- contains only TyrosinePhosphataseDomain or

TransmembraneDomain or FibronectinDomain or ImnunoglobulinDomain or Clathrin-LikeDomain or ManDomain

Page 17: Using Ontology to Classify Members of a Protein Family

Automatic Reasoning

• An OWL-DL ontology mapped to its dL form as a collection of axioms

• An automatic reasoner checks for satisfiability – throws out the inconsistant and infers subsumption

• Defined classes (where there are necessary and sufficient restrictions) enable a reasoner to infer subclass axioms

• Also infer to which class an object belongs• Based on the facts we know about it

Page 18: Using Ontology to Classify Members of a Protein Family

Incremental Addition of Protein Functional Domains

Phosphatase catalytic

Cadherin-like

Immunoglobulin

MAM domain Cellular retinaldehyde

Adhesion recognition Transmembrane

Fibronectin III Glycosylation

Page 19: Using Ontology to Classify Members of a Protein Family

Building the Ontology

• Classifications already made by biologists – based on protein functionality;

• Protein domain composition and other details in the literature;

• Some 50 classes of phosphatase, 30 protein domains and one relationship;

• ”Value partition” of protein domains (covering and disjoint);

• Defines range of contains property;• Literature contains knowledge of how to recognise

members of each class of phosphatase.

Page 20: Using Ontology to Classify Members of a Protein Family

Classification of the Classical Tyrosine Phosphatases

Page 21: Using Ontology to Classify Members of a Protein Family

What is the Ontology Telling Us?

• Each class of phosphatase defined in terms of domain composition

• We know the characteristics by which an individual protein can be recognised to be a member of a particular class of phosphatase

• We have this knowledge in a computational form• If we had protein instances described in terms of

the ontology, we could classify those individual proteins

• A catalogue of phosphatases

Page 22: Using Ontology to Classify Members of a Protein Family

Description of an Instance of a Protein

• Instance: P21592        TypeOf: Protein ThatFact: hasDomain 2 ProteinTyrosinePhosphataseDomain and Fact: hasdomain 1 TransmembraneDomain and 

Fact: hasdomain 4 FibronectinDomains and Fact: hasDomain 1 ImmunoglobulinDomain and

Fact: hasdomain 1 MAMDomain and Fact: hasdomain 1 Cadherin-LikeDomain

Page 23: Using Ontology to Classify Members of a Protein Family

Instance: P21592        TypeOf: Protein ThatFact: hasDomain 2 ProteinTyrosinePhosphataseDomain and Fact: hasdomain 1 TransmembraneDomain and  Fact: hasdomain 4 FibronectinDomains and Fact: hasDomain 1 ImmunoglobulinDomain and Fact: hasdomain 1 MAMDomain and Fact: hasdomain 1 Cadherin-LikeDomain

Tyrosine Phosphatase(containsDomain some TransmembraneDomain) and(containsDomain at least 1 ProteinTyrosinePhosphataseDomain)

R2A Phosphatase(containsDomain some MAMDomain) and(containsDomain some ProteinTyrosineCatalyticDomain or ImmunoglobulinDomain) and(containsDomain some FibronectinDomain or FibronectinTypeIIIFoldDomain) and(containsDomain exactly 2 ProteinTyrosinePhosphataseDomain)

Page 24: Using Ontology to Classify Members of a Protein Family

Classifying Proteins>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine

phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..

InterPro

Instance Store

Reasoner

Translate

Codify

Page 25: Using Ontology to Classify Members of a Protein Family

So Far…..• Human phosphatases have been classified using the

system• The ontology classification performed equally well as

expert classification• The ontology system refined classification

- DUSC contains zinc finger domain Characterised and conserved – but not in classification- DUSA contains a disintegrin domain previously uncharacterised – evolutionarily conserved

• A new kind of phosphatase?

Page 26: Using Ontology to Classify Members of a Protein Family

Aspergillus fumigatus• Phosphatase compliment very different from human

>100 human <50 A.fumigatus• Whole subfamilies ‘missing’

Different fungi-specific phosphorylation pathways?No requirement for tissue-specific variations?

• Novel serine/threonine phosphatase with homeobox Conserved in aspergillus and closely related species, but not in any other

Again, a new phosphatase?

Page 27: Using Ontology to Classify Members of a Protein Family

Scaling

• Over 700 protein families

• Some 14,000 described sequence features

• Hundreds of thousands types of protein

• Mass classification, then what?

Page 28: Using Ontology to Classify Members of a Protein Family

Generic Technique

• Feature detection

• Categories defined in terms of those features

• Produce catalogue of what you currently know

• Highlight cases that don’t match current knowledge

Page 29: Using Ontology to Classify Members of a Protein Family

Conclusions• Using ontology allows automated classification to

reach the standard of human expert annotation• Reasoning capabilities allow interpretation of domain

organisation• Capturing human knowledge in computational form• Systematic survey produces interesting biological

questions• Discovering the unexpected• Allows fast, efficient comparative genomics studies• A combination of CS and bioinformatics to do biology