Download - Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities
Making protein func0on and subcellular localiza0on predic0ons – challenges and
opportuni0es
Fiona Brinkman
Department of Molecular Biology and Biochemistry (Associate, Faculty of Health Sciences and School of Compu0ng Sciences)
Simon Fraser University Greater Vancouver, BC, Canada
April 2014
• Improving seq similarity/orthology-‐based predic0ons – a keystone of many predictors
• Improving pathway/network-‐based analysis to iden0fy protein func0ons
• Future challenges and opportuni0es (using protein localiza0on as an example of what is to come)
What we MUST do to move AFP forward…. 2
3
One-‐to-‐one orthologs are, in par0cular, more func0onally similar to each other, vs other orthologs, paralogs, when >80% seq iden0ty
Func0onal similarity measured by GO annota0on similarity (13 species) Altenhoff AM et al. PLoS Comput Biol. 2012
4
One-‐to-‐one orthologs are, in par0cular, more func0onally similar to each other, vs other orthologs, paralogs, when >80% seq iden0ty
Func0onal similarity measured by GO annota0on similarity (13 species) Altenhoff AM et al. PLoS Comput Biol. 2012
6
If true ortholog is missing… (gene loss, or incomplete genome)
Ingroup1 Ingroup2 Outgroup
Species Tree:
Gene Tree:
Ingroup1 Ingroup2 Outgroup
RBBH
Reciprocal Best Blast Hit FAIL
Gene Tree:
Ingroup1 Outgroup
Ingroup2
Usual Divergence
One of the orthologous genes diverges faster…
Paralog
RBBH
Paralog
Ortholuge Uses phyle0c ra0os to differen0ate Suppor0ng Species Divergence (SSD) orthologs vs proteins more divergent than expected (non-‐SSD)
7
Ra*o1 distance { ingroup1-‐ingroup2 } distance { ingroup1-‐outgroup }
Ingroup1 Ingroup2 Outgroup
SSD
Non-‐SSD
Ortholuge analysis comparing Burkholderia cepacia & B.cenocepacia (outgroup: B.pseudomallei)
Ra*o2 distance { ingroup1-‐ingroup2 } distance { ingroup2-‐outgroup }
Ingroup1 Ingroup2 Outgroup
Whiteside et al 2013 PMID 23203876
0.000
0.200
0.400
0.600
0.800
1.000
KEGG Orthology
Pfam Domains Tigrfam Annota0ons
Subcellular Localiza0ons
Prop
or*o
n Predicted Orthologs in 600 Pairs of Bacterial Species
SSD Ortholog
Non-‐SSD
8
* * * *
* p-‐value < 0.05
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
One or more homologs (based on
BLAST hits)
Prop
or*o
n
SSD orthologs
Non-‐SSD
*
* p-‐value < 0.05
Non-‐SSD “Orthologs” more likely:
-‐ Func0onally dissimilar -‐ Have one or more homologs
A Database of Ortholuge Evaluations OrtholugeDB (0nyurl.com/ortholugeDB)
• Provides pre-‐computed ortholog predic0ons for >1400 bacteria and archaea (update coming next month!), with further Ortholuge assessments
• Covers all genes in fully sequenced bacterial and archaeal genomes • Facilitates visualiza0on and evalua0on of ortholog predic0ons
9
Similar issue with ini0al metagenomics seq func0onal evalua0on
1. Simulated reads from Pseudomonas aeruginosa PAO1
2. Created databases at different levels of clade exclusion • E.g. for species clade exclusion removed all Pseudomonas
aeruginosa genomes from the database
3. Used RAPSearch2 and MEGAN5 to assign func0onal categories to the simulated reads
4. Calculated propor0on of reads assigned to each func0onal category rela0ve to how many reads expected
• E.g:
10
Category Expected # assigned
Actual # assigned
Rela0ve Propor0on
Membrane Transport 567 583 1.02822
Most func0onal categories are predicted well but some are overpredicted (ra0o notably >1)
0
0.5
1
1.5
2
2.5
Ra*o
of a
ssigne
d
rela*v
e to expected
None
Species
Family
Class
Level of clade exclusion:
Ie. Endocrine system: 3 problematic orthology groups – all with high #’s of proteins (one has 3538 when median is 54!)
The rela0ve propor0ons of func0onal categories stays rela0vely consistent as clade exclusion level increases
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
None Species Family Class
Prop
or*o
n of re
ads a
ssigne
d
Clade exclusion level
Xenobio0cs Biodegrada0on and Metabolism Transcrip0on
Signal Transduc0on
Replica0on and Repair
Infec0ous Diseases
Nucleo0de Metabolism
Neurodegenera0ve Diseases Metabolism of Other Amino Acids Metabolism of Cofactors and Vitamins Membrane Transport
…
Improving pathway-‐based analysis
Issue: Biomolecular pathway classifica0ons can bias analyses of pathways found to be upregulated or downregulated by transcriptome (or other omics-‐level) analysis What you iden0fy depends on how everything is classified…. Need beper “signatures” of pathways…
Dealing with PART of the issue…
Distribu0on of the number of associated pathways for human genes in KEGG.
1
7-45
2
34
5
6
Membership of a gene in mul0ple pathways is the norm, not the excep0on…
Foroushani et al, 2014 PMCID: PMC3883547
Not all genes are equal…
Maroon: pathway member White: no membership
All genes are not equivalent signatures of a given pathway
Foroushani et al, 2014 PMCID: PMC3883547
Individual Gene ORA Antigen processing and presentation Graft-versus-host disease Natural killer cell mediated cytotoxicity Viral myocarditis Allograft rejection Cell adhesion molecules (CAMs) Chemokine signaling pathway Type I diabetes mellitus Toll-like receptor signaling pathway Cytokine-cytokine receptor interaction
Example: Treated vs Untreated Mouse Severe InflammaIon – Gene Expression Dataset
Standard Over-Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) treat all genes in a given pathway as equal indicators that that pathway is significant. à Emphasizes generalist genes/pathways
Foroushani et al, 2014 PMCID: PMC3883547
Pathway Signatures using SIGORA: IdenIfying genes/gene pairs uniquely associated with a single pathway
SIGORA identifies statistically significant enrichment of Pathway Signatures in a gene list of interest.
Foroushani et al, 2014 PMCID: PMC3883547
Example: Treated vs Untreated Mouse Severe Inflammation – Gene Expression Dataset
SIGORA avoids many biologically less plausible results seen by other
methods that over-‐emphasize generalist genes/pathways.
For example, 6/8 up-regulated genes in “Type I diabetes mellitus” pathway are also in the "Antigen processing and presentation" pathway.
Individual Gene ORA SIGORA Antigen processing and presentation Antigen processing and presentation Graft-versus-host disease Natural killer cell mediated cytotoxicity Natural killer cell mediated cytotoxicity Complement and coagulation cascades Viral myocarditis Toll-like receptor signaling pathway Allograft rejection Cytokine-cytokine receptor interaction Cell adhesion molecules (CAMs) Leukocyte transendothelial migration Chemokine signaling pathway Cell adhesion molecules (CAMs) Type I diabetes mellitus Cytosolic DNA-sensing pathway Toll-like receptor signaling pathway Chemokine signaling pathway Cytokine-cytokine receptor interaction
Future challenges and opportuni0es
(using bacterial protein localiza0on as an example of what is to come)
(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741) 19
Bacterial protein subcellular localiza0on predic0on
• Aids genome annota0on and predic0on of protein func0on • Used to iden0fy cell surface/secreted targets for drugs and
diagnos0cs, as well as poten0al vaccine components • Many pathogen-‐associated virulence factors predicted as secreted
(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741) 20
Signal pep0des: Non-‐cytoplasmic Amino acid composi0on/paperns: All localiza0ons
-‐ Support Vector Machine’s trained with amino acid composi0ons or frequent subsequences Transmembrane helices: Cytoplasmic membrane
-‐ HMMTOP PROSITE mo0fs with 100% precision: All localiza0ons Outer membrane mo0fs: Outer membrane
-‐ Iden0fied by associa0on-‐rule mining Homology to proteins of experimentally known localiza0on: All loc.
-‐ “SCL-‐BLAST” against pro of known localiza0on -‐ E=10e-‐10 and length restric0on for precision
Integra0on with a Baysian Network
Yu et al (2010) BioinformaIcs 26:1608
PSORTb: bacterial protein subcellular localiza0on (SCL) predic0on sosware
PSORTb: version 3
22
• Type III secre0on apparatus • Pili/fimbria • Host-‐associated SCL • Flagellum • Spore • Gas vesicle
Sub-‐category localiza0on predic0ons
Main localiza0ons predicted Bacteria and Archaea predic0ons
Gram-‐nega6ve SoNware Precision Recall PSORTb v3.0 96.8 88.0 PSORTb v2.0 95.7 81.5 Gram-‐posi6ve PSORTb v3.0 97.0 93.2 PSORTb v2.0 96.7 89.3
Archaea PSORTb v3.0
95.0 93.3
PSORTb v3.0: high precision, improved sensi0vity/recall and genome predic0on coverage
0
10
20
30
40
50
60
70
80
90
100
PSORTb v.2.0
PSORTb v.3.0
Five-‐fold cross valida0on Genome predic0on coverage
Gram-‐negaIve Gram-‐posiIve
A computa0onal predictor more accurate than related high-‐throughput lab methods
Classic Gram posi0ve bacteria, monoderms: Thick pep0doglycan, no outer membrane Classic Gram nega0ve bacteria, diderms: Thin pep0doglycan + outer membrane …but can have Gram nega0ves with no outer membrane (i.e. Mycoplasma) or a different outer membrane (Synergistetes, Sphingomonas), or Gram posi0ve (thick peptdoglycan) with a different outer membrane (Deinococcus – 6 layers in cell envelope!), or “acid fast”with asymmetric lipid-‐containing thick cell wall (Mycobacteria) Plus bacterial organelles and other substructures (ie. magnetosome of Magnetospirillum)... Solu*on: -‐ For whole genome (deduced-‐proteome) analysis, detect key protein markers of a par0cular cell type (i.e. Omp85 essen0al for classic Gram nega0ve membrane) -‐ For single protein analysis, learn from above analysis, plus literature cura0on, the most likely cell type for a given phyla
…then make predic0ons assuming that cell “type”
Challenge: Organismal diversity
24
Reproduced under Fair Use
Challenge: Temporal, contextual diversity
Proteins can be associated with mul0ple subcellular localiza0ons
i.e. Cell division proteins, Autotransporters, “protein A dependant on protein B” Solu0on: Note all possible localizaIons since Temporal, contextual predic0ons non-‐trivial – not enough knowledge for most
Kjærgaard K et al. J. Bacteriol. 2000;182:4789-4796
Challenge: Metagenomics
High demand for PSORTb to be able to analyze metagenomic sequences …. under development
Need taxonomy data to aid predic0ons
(then enable appropriate cell type analysis)
Through over a decade of cura0ng for, making and evalua0ng predictors of protein localiza0on, genomic islands, etc What makes a great predictor?
Through over a decade of cura0ng for, making and evalua0ng predictors of protein localiza0on, genomic islands, etc What makes a great predictor? (besides it being right) ☺
Bioinforma0cs Predictor’s Code of Conduct
-‐ Never force predic0ons -‐ always have a predic0on op0on/category of “unknown”
Inspired by the classic “Data Provider’s Code of Conduct” in Stein (2002) Nature 417, 119-‐120
Example of forced predic0ons: PSORT I predic0on method
Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%
What’s wrong here?
Example of forced predic0ons: PSORT I predic0on method
Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%
No secreted/extracellular localization!
Inspired by the classic “Data Provider’s Code of Conduct” in Stein (2002) Nature 417, 119-‐120 -‐ Never force predic0ons -‐ always have “unknown” op0on/category -‐ Ensure open source -‐ enable viewing of predic0on method details -‐ Predictor should easily be trainable with different datasets (if applicable; so others can robustly evaluate accuracy) -‐ Have ability to run locally or over web (with an API is preferred)
-‐ Provide access to old versions (at minimum when transi0oning to new version)
-‐ Encourage con0nuing cura0on from the literature/lab experiments! Incorporate some curaIon efforts into predictor funding applicaIons
Bioinforma0cs Predictor’s Code of Conduct
Bioinforma0cs Predictor’s Code of Conduct -‐ evalua*on
33
-‐ Evaluate precision and recall (and accuracy measure combos thereof) with x-‐fold cross valida0on and/or new datasets (like CAFA!) -‐ ID errors, biases and provide guidance to users re issues to watch for
-‐ bias in training and/or tes0ng datasets (“homology reduc0on”, “clade exclusion” may help) -‐ errors in “gold standard” lab-‐based measure -‐ contextual/temporal changes in proteins, impac0ng predic0on (ie. Func0on changes when another protein/compound present)
What we MUST do: Guide users to not just blindly use a predictor and its default output.
What we MUST do: Guide users to not just blindly use a predictor and its default output. Curators, experimentalists, and automated funcIon predictor developers must coordinate efforts more • Experimentalists working on what
they think best… • Curators cura0ng what they
priori0ze… • Func0on predictors op0mizing
predic0on using exis0ng data…. FuncIon predictors/bioinformaIcists need to get in the drivers seat more for research
Bioinforma0cs Predictor’s Code of Conduct
Brinkman Lab Kayaking Trip, Summer 2013
(Next up, Archery Tag!)
Amir Foroushani Maphew Laird David Lynn Raymond Lo
Mike Peabody Thea Van Rossum Maphew Whiteside Nancy Yu