using ontology to classify members of a protein family

Post on 21-May-2015

107 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Invited talk at Cambridge Chemistry Department

TRANSCRIPT

Using Ontology to Classify Members of a Protein Family

Robert StevensBioHealth Informatics Group

School of Computer Science

University of ManchesterRobert.stevens@manchester.ac.uk

Introduction• Developing an automated system for extracting and

classifying proteins from newly sequenced genomes• Building an OWL ontology that defines class

membership• Describing protein instances in OWL• Classifying against the ontology• Describing the protein family complement of a

genome• As good as human classification, but added value• Only possible through inter-disciplinary research

Acknowledgements (it takes all sorts)

Katy Wolstencroft (Bioinformatics)

Daniele Turi (Instance Store)

Phil Lord (myGrid)

Lydia Tabernero (Protein Scientist)

Matt Horridge, Nick Drummond et al (Protégé OWL)

Andy Brass and Robert Stevens (Bioinformatics)

Protein Classification• Proteins divided into broad functional classes

“Protein Families”• Families sub-divided to give family

classifications• Class membership cam be determined by

“protein features”, such as domains, etc.• Resources exist for feature detection via

primary sequence– but not class membership

• Current Limitation of Automated Tools• Needs human knowledge to recognise class

membership

Finding Domains on a Sequence

A search of the linear sequence of protein tyrosine phosphatase type K – identified 9 functional domains

>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).

MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV…

……..

Why Classify?• Classification and curation of a genome is the

first step in understanding the processes and functions happening in an organism

• Classification enables comparative genomic studies - what is already known in other organisms

• The similarities and differences between processes and functions in related organisms often provide the greatest insight into the biology

• In silico characterisation is the current bottleneck

The Protein Phosphatases

• large superfamily of proteins – involved in the removal of phosphate groups from molecules

• Important proteins in almost all cellular processes

• Involved in diseases – diabetes and cancer• human phosphatases well characterised

Phosphatase Classification

• Diagnostic phosphatase domains/motifs – sufficient for membership of the protein phosphatase superfamily

• Any protein having a phosphatase domain is a member of the phosphatase super-family

• Other motifs determine a protein’s place within the family

• Usually needs human to recognise that features detected imply class membership

• Can these be captured in an ontology?

Ontologies

• Describing and defining the classes of objects represented in information

• Defining the characteristics of objects

• The characteristics by which it can be recognised to which class an object belongs

• In a form understandable by a computer

• … and, of course, humans.

Web Ontology Language (OWL)

• W3C recommendation for ontologies for the Semantic Web

• OWL-DL mapped to a decidable fragment of first order logic

• Classes, properties and instances• Boolean operators, plus existential and

universal quantification• Rich class expressions used in restriction on

properties – hasDomain some (ImnunoGlobinDomain or FibronectinDomain)

OWL represents classes of instances

A

BC

Necessity and Sufficiency

• An R2A phosphatase must have a fibronectin domain

• Having a fibronectin domain does not a phosphatase make

• Necessity -- what must a class instance have? • Any protein that has a phosphatase catalytic

domain is a phosphatase enzyme• All phosphatase enzymes have a catalytic domain• Sufficiency – how is an instance recognised to be a

member of a class?

Definition of Tyrosine Phosphatase

Class TyrosineRreceptorProteinPhosphatase

EquivalentTo: Protein That- contains atLeast-1

ProteinTyrosinePhosphataseDomain and- contains EXACTLY 1

TransmembraneDomain

…there are known knowns; there are things we know we know. We also know there are

known unknowns; that is to say we know there are some things we do not know. But

there are also unknown unknowns -- the ones we don't know we don't know.

Definition of Tyrosine Phosphatase: What we Know we Know

Class TyrosineRreceptorProteinPhosphatase

EquivalentTo: Protein That- contains atLeast-1

ProteinTyrosinePhosphataseDomain and- contains EXACGTLY 1

TransmembraneDomain

Definition for R2A Phosphatase

Class: R2AEquivalentTO: Protein That- contains 2 ProteinTyrosinePhosphataseDomain and- (contains 1 TransmembraneDomain )and - (contains 4 FibronectinDomains) and- contains 1 ImmunoglobulinDomain and- contains 1 MAMDomain and- contains 1 Cadherin-LikeDomain and- contains only TyrosinePhosphataseDomain or

TransmembraneDomain or FibronectinDomain or ImnunoglobulinDomain or Clathrin-LikeDomain or ManDomain

Automatic Reasoning

• An OWL-DL ontology mapped to its dL form as a collection of axioms

• An automatic reasoner checks for satisfiability – throws out the inconsistant and infers subsumption

• Defined classes (where there are necessary and sufficient restrictions) enable a reasoner to infer subclass axioms

• Also infer to which class an object belongs• Based on the facts we know about it

Incremental Addition of Protein Functional Domains

Phosphatase catalytic

Cadherin-like

Immunoglobulin

MAM domain Cellular retinaldehyde

Adhesion recognition Transmembrane

Fibronectin III Glycosylation

Building the Ontology

• Classifications already made by biologists – based on protein functionality;

• Protein domain composition and other details in the literature;

• Some 50 classes of phosphatase, 30 protein domains and one relationship;

• ”Value partition” of protein domains (covering and disjoint);

• Defines range of contains property;• Literature contains knowledge of how to recognise

members of each class of phosphatase.

Classification of the Classical Tyrosine Phosphatases

What is the Ontology Telling Us?

• Each class of phosphatase defined in terms of domain composition

• We know the characteristics by which an individual protein can be recognised to be a member of a particular class of phosphatase

• We have this knowledge in a computational form• If we had protein instances described in terms of

the ontology, we could classify those individual proteins

• A catalogue of phosphatases

Description of an Instance of a Protein

• Instance: P21592        TypeOf: Protein ThatFact: hasDomain 2 ProteinTyrosinePhosphataseDomain and Fact: hasdomain 1 TransmembraneDomain and 

Fact: hasdomain 4 FibronectinDomains and Fact: hasDomain 1 ImmunoglobulinDomain and

Fact: hasdomain 1 MAMDomain and Fact: hasdomain 1 Cadherin-LikeDomain

Instance: P21592        TypeOf: Protein ThatFact: hasDomain 2 ProteinTyrosinePhosphataseDomain and Fact: hasdomain 1 TransmembraneDomain and  Fact: hasdomain 4 FibronectinDomains and Fact: hasDomain 1 ImmunoglobulinDomain and Fact: hasdomain 1 MAMDomain and Fact: hasdomain 1 Cadherin-LikeDomain

Tyrosine Phosphatase(containsDomain some TransmembraneDomain) and(containsDomain at least 1 ProteinTyrosinePhosphataseDomain)

R2A Phosphatase(containsDomain some MAMDomain) and(containsDomain some ProteinTyrosineCatalyticDomain or ImmunoglobulinDomain) and(containsDomain some FibronectinDomain or FibronectinTypeIIIFoldDomain) and(containsDomain exactly 2 ProteinTyrosinePhosphataseDomain)

Classifying Proteins>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine

phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHVSAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNPGTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYIAIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..

InterPro

Instance Store

Reasoner

Translate

Codify

So Far…..• Human phosphatases have been classified using the

system• The ontology classification performed equally well as

expert classification• The ontology system refined classification

- DUSC contains zinc finger domain Characterised and conserved – but not in classification- DUSA contains a disintegrin domain previously uncharacterised – evolutionarily conserved

• A new kind of phosphatase?

Aspergillus fumigatus• Phosphatase compliment very different from human

>100 human <50 A.fumigatus• Whole subfamilies ‘missing’

Different fungi-specific phosphorylation pathways?No requirement for tissue-specific variations?

• Novel serine/threonine phosphatase with homeobox Conserved in aspergillus and closely related species, but not in any other

Again, a new phosphatase?

Scaling

• Over 700 protein families

• Some 14,000 described sequence features

• Hundreds of thousands types of protein

• Mass classification, then what?

Generic Technique

• Feature detection

• Categories defined in terms of those features

• Produce catalogue of what you currently know

• Highlight cases that don’t match current knowledge

Conclusions• Using ontology allows automated classification to

reach the standard of human expert annotation• Reasoning capabilities allow interpretation of domain

organisation• Capturing human knowledge in computational form• Systematic survey produces interesting biological

questions• Discovering the unexpected• Allows fast, efficient comparative genomics studies• A combination of CS and bioinformatics to do biology

top related