Download - Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Making protein func0on and subcellular localiza0on predic0ons – challenges and

opportuni0es

Fiona Brinkman

Department of Molecular Biology and Biochemistry (Associate, Faculty of Health Sciences and School of Compu0ng Sciences)

Simon Fraser University Greater Vancouver, BC, Canada

April 2014

•  Improving seq similarity/orthology-‐based predic0ons – a keystone of many predictors

•  Improving pathway/network-‐based analysis to iden0fy protein func0ons

•  Future challenges and opportuni0es (using protein localiza0on as an example of what is to come)

What we MUST do to move AFP forward…. 2

3

One-‐to-‐one orthologs are, in par0cular, more func0onally similar to each other, vs other orthologs, paralogs, when >80% seq iden0ty

Func0onal similarity measured by GO annota0on similarity (13 species) Altenhoff AM et al. PLoS Comput Biol. 2012

4

One-‐to-‐one orthologs are, in par0cular, more func0onally similar to each other, vs other orthologs, paralogs, when >80% seq iden0ty

Func0onal similarity measured by GO annota0on similarity (13 species) Altenhoff AM et al. PLoS Comput Biol. 2012

6

If true ortholog is missing… (gene loss, or incomplete genome)

Ingroup1 Ingroup2 Outgroup

Species Tree:

Gene Tree:


RBBH

Reciprocal Best Blast Hit FAIL

Gene Tree:

Ingroup1 Outgroup

Ingroup2

Usual Divergence

One of the orthologous genes diverges faster…

Paralog

RBBH

Paralog

Ortholuge Uses phyle0c ra0os to differen0ate Suppor0ng Species Divergence (SSD) orthologs vs proteins more divergent than expected (non-‐SSD)

7

Ra*o1 distance { ingroup1-‐ingroup2 } distance { ingroup1-‐outgroup }


SSD

Non-‐SSD

Ortholuge analysis comparing Burkholderia cepacia & B.cenocepacia (outgroup: B.pseudomallei)

Ra*o2 distance { ingroup1-‐ingroup2 } distance { ingroup2-‐outgroup }


Whiteside et al 2013 PMID 23203876

0.000

0.200

0.400

0.600

0.800

1.000

KEGG Orthology

Pfam Domains Tigrfam Annota0ons

Subcellular Localiza0ons

Prop

or*o

n Predicted Orthologs in 600 Pairs of Bacterial Species

SSD Ortholog

Non-‐SSD

8

* * * *

* p-‐value < 0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

One or more homologs (based on

BLAST hits)

Prop

or*o

n

SSD orthologs

Non-‐SSD

*

* p-‐value < 0.05

Non-‐SSD “Orthologs” more likely:

-‐ Func0onally dissimilar -‐ Have one or more homologs

A Database of Ortholuge Evaluations OrtholugeDB (0nyurl.com/ortholugeDB)

•  Provides pre-‐computed ortholog predic0ons for >1400 bacteria and archaea (update coming next month!), with further Ortholuge assessments

•  Covers all genes in fully sequenced bacterial and archaeal genomes •  Facilitates visualiza0on and evalua0on of ortholog predic0ons

9

Similar issue with ini0al metagenomics seq func0onal evalua0on

1.  Simulated reads from Pseudomonas aeruginosa PAO1

2.  Created databases at different levels of clade exclusion •  E.g. for species clade exclusion removed all Pseudomonas

aeruginosa genomes from the database

3.  Used RAPSearch2 and MEGAN5 to assign func0onal categories to the simulated reads

4.  Calculated propor0on of reads assigned to each func0onal category rela0ve to how many reads expected

•  E.g:

10

Category Expected # assigned

Actual # assigned

Rela0ve Propor0on

Membrane Transport 567 583 1.02822

Most func0onal categories are predicted well but some are overpredicted (ra0o notably >1)

0

0.5

1

1.5

2

2.5

Ra*o

of a

ssigne

d

rela*v

e to expected

None

Species

Family

Class

Level of clade exclusion:

Ie. Endocrine system: 3 problematic orthology groups – all with high #’s of proteins (one has 3538 when median is 54!)

The rela0ve propor0ons of func0onal categories stays rela0vely consistent as clade exclusion level increases

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

None Species Family Class

Prop

or*o

n of re

ads a

ssigne

d

Clade exclusion level

Xenobio0cs Biodegrada0on and Metabolism Transcrip0on

Signal Transduc0on

Replica0on and Repair

Infec0ous Diseases

Nucleo0de Metabolism

Neurodegenera0ve Diseases Metabolism of Other Amino Acids Metabolism of Cofactors and Vitamins Membrane Transport

…

Improving pathway-‐based analysis

Issue: Biomolecular pathway classifica0ons can bias analyses of pathways found to be upregulated or downregulated by transcriptome (or other omics-‐level) analysis What you iden0fy depends on how everything is classified…. Need beper “signatures” of pathways…

Dealing with PART of the issue…

Distribu0on of the number of associated pathways for human genes in KEGG.

1

7-45

2

34

5

6

Membership of a gene in mul0ple pathways is the norm, not the excep0on…

Foroushani et al, 2014 PMCID: PMC3883547

Not all genes are equal…

Maroon: pathway member White: no membership

All genes are not equivalent signatures of a given pathway


Individual Gene ORA Antigen processing and presentation Graft-versus-host disease Natural killer cell mediated cytotoxicity Viral myocarditis Allograft rejection Cell adhesion molecules (CAMs) Chemokine signaling pathway Type I diabetes mellitus Toll-like receptor signaling pathway Cytokine-cytokine receptor interaction

Example: Treated vs Untreated Mouse Severe InflammaIon – Gene Expression Dataset

Standard Over-Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) treat all genes in a given pathway as equal indicators that that pathway is significant. à Emphasizes generalist genes/pathways


Pathway Signatures using SIGORA: IdenIfying genes/gene pairs uniquely associated with a single pathway

SIGORA identifies statistically significant enrichment of Pathway Signatures in a gene list of interest.


Example: Treated vs Untreated Mouse Severe Inflammation – Gene Expression Dataset

SIGORA avoids many biologically less plausible results seen by other

methods that over-‐emphasize generalist genes/pathways.

For example, 6/8 up-regulated genes in “Type I diabetes mellitus” pathway are also in the "Antigen processing and presentation" pathway.

Individual Gene ORA SIGORA Antigen processing and presentation Antigen processing and presentation Graft-versus-host disease Natural killer cell mediated cytotoxicity Natural killer cell mediated cytotoxicity Complement and coagulation cascades Viral myocarditis Toll-like receptor signaling pathway Allograft rejection Cytokine-cytokine receptor interaction Cell adhesion molecules (CAMs) Leukocyte transendothelial migration Chemokine signaling pathway Cell adhesion molecules (CAMs) Type I diabetes mellitus Cytosolic DNA-sensing pathway Toll-like receptor signaling pathway Chemokine signaling pathway Cytokine-cytokine receptor interaction

Future challenges and opportuni0es

(using bacterial protein localiza0on as an example of what is to come)

(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741) 19

Bacterial protein subcellular localiza0on predic0on

•  Aids genome annota0on and predic0on of protein func0on •  Used to iden0fy cell surface/secreted targets for drugs and

diagnos0cs, as well as poten0al vaccine components •  Many pathogen-‐associated virulence factors predicted as secreted

(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741) 20

Signal pep0des: Non-‐cytoplasmic Amino acid composi0on/paperns: All localiza0ons

-‐ Support Vector Machine’s trained with amino acid composi0ons or frequent subsequences Transmembrane helices: Cytoplasmic membrane

-‐ HMMTOP PROSITE mo0fs with 100% precision: All localiza0ons Outer membrane mo0fs: Outer membrane

-‐ Iden0fied by associa0on-‐rule mining Homology to proteins of experimentally known localiza0on: All loc.

-‐ “SCL-‐BLAST” against pro of known localiza0on -‐ E=10e-‐10 and length restric0on for precision

Integra0on with a Baysian Network

Yu et al (2010) BioinformaIcs 26:1608

PSORTb: bacterial protein subcellular localiza0on (SCL) predic0on sosware

PSORTb: version 3

22

• Type III secre0on apparatus • Pili/fimbria • Host-‐associated SCL • Flagellum • Spore • Gas vesicle

Sub-‐category localiza0on predic0ons

Main localiza0ons predicted Bacteria and Archaea predic0ons

Gram-‐nega6ve SoNware Precision Recall PSORTb v3.0 96.8 88.0 PSORTb v2.0 95.7 81.5 Gram-‐posi6ve PSORTb v3.0 97.0 93.2 PSORTb v2.0 96.7 89.3

Archaea PSORTb v3.0

95.0 93.3

PSORTb v3.0: high precision, improved sensi0vity/recall and genome predic0on coverage

0

10

20

30

40

50

60

70

80

90

100

PSORTb v.2.0

PSORTb v.3.0

Five-‐fold cross valida0on Genome predic0on coverage

Gram-‐negaIve Gram-‐posiIve

A computa0onal predictor more accurate than related high-‐throughput lab methods

Classic Gram posi0ve bacteria, monoderms: Thick pep0doglycan, no outer membrane Classic Gram nega0ve bacteria, diderms: Thin pep0doglycan + outer membrane …but can have Gram nega0ves with no outer membrane (i.e. Mycoplasma) or a different outer membrane (Synergistetes, Sphingomonas), or Gram posi0ve (thick peptdoglycan) with a different outer membrane (Deinococcus – 6 layers in cell envelope!), or “acid fast”with asymmetric lipid-‐containing thick cell wall (Mycobacteria) Plus bacterial organelles and other substructures (ie. magnetosome of Magnetospirillum)... Solu*on: -‐  For whole genome (deduced-‐proteome) analysis, detect key protein markers of a par0cular cell type (i.e. Omp85 essen0al for classic Gram nega0ve membrane) -‐ For single protein analysis, learn from above analysis, plus literature cura0on, the most likely cell type for a given phyla

…then make predic0ons assuming that cell “type”

Challenge: Organismal diversity

24

Reproduced under Fair Use

Challenge: Temporal, contextual diversity

Proteins can be associated with mul0ple subcellular localiza0ons

i.e. Cell division proteins, Autotransporters, “protein A dependant on protein B” Solu0on: Note all possible localizaIons since Temporal, contextual predic0ons non-‐trivial – not enough knowledge for most

Kjærgaard K et al. J. Bacteriol. 2000;182:4789-4796

Challenge: Metagenomics

High demand for PSORTb to be able to analyze metagenomic sequences …. under development

Need taxonomy data to aid predic0ons

(then enable appropriate cell type analysis)

Through over a decade of cura0ng for, making and evalua0ng predictors of protein localiza0on, genomic islands, etc What makes a great predictor?

Through over a decade of cura0ng for, making and evalua0ng predictors of protein localiza0on, genomic islands, etc What makes a great predictor? (besides it being right) ☺

Bioinforma0cs Predictor’s Code of Conduct

-‐ Never force predic0ons -‐ always have a predic0on op0on/category of “unknown”

Inspired by the classic “Data Provider’s Code of Conduct” in Stein (2002) Nature 417, 119-‐120

Example of forced predic0ons: PSORT I predic0on method

Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%

What’s wrong here?

Example of forced predic0ons: PSORT I predic0on method

Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%

No secreted/extracellular localization!

Inspired by the classic “Data Provider’s Code of Conduct” in Stein (2002) Nature 417, 119-‐120 -‐ Never force predic0ons -‐ always have “unknown” op0on/category -‐ Ensure open source -‐ enable viewing of predic0on method details -‐  Predictor should easily be trainable with different datasets (if applicable; so others can robustly evaluate accuracy) -‐  Have ability to run locally or over web (with an API is preferred)

-‐  Provide access to old versions (at minimum when transi0oning to new version)

-‐ Encourage con0nuing cura0on from the literature/lab experiments! Incorporate some curaIon efforts into predictor funding applicaIons


Bioinforma0cs Predictor’s Code of Conduct -‐ evalua*on

33

-‐ Evaluate precision and recall (and accuracy measure combos thereof) with x-‐fold cross valida0on and/or new datasets (like CAFA!) -‐  ID errors, biases and provide guidance to users re issues to watch for

-‐  bias in training and/or tes0ng datasets (“homology reduc0on”, “clade exclusion” may help) -‐ errors in “gold standard” lab-‐based measure -‐ contextual/temporal changes in proteins, impac0ng predic0on (ie. Func0on changes when another protein/compound present)

What we MUST do: Guide users to not just blindly use a predictor and its default output.

What we MUST do: Guide users to not just blindly use a predictor and its default output. Curators, experimentalists, and automated funcIon predictor developers must coordinate efforts more •  Experimentalists working on what

they think best… •  Curators cura0ng what they

priori0ze… •  Func0on predictors op0mizing

predic0on using exis0ng data…. FuncIon predictors/bioinformaIcists need to get in the drivers seat more for research


Brinkman Lab Kayaking Trip, Summer 2013

(Next up, Archery Tag!)

Amir Foroushani Maphew Laird David Lynn Raymond Lo

Mike Peabody Thea Van Rossum Maphew Whiteside Nancy Yu

Download - Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Top Related