f. - quteprints.qut.edu.au/83576/1/83576_ijmm_final.pdfstatistics in the sam (statistical analysis...

20
This may be the author’s version of a work that was submitted/accepted for publication in the following source: Broadbent, James, Sampson, Dayle, Broszczak, Daniel, Upton, Zee,& Huygens, Flavia (2015) Choose wisely: Network, ontology and annotation resources for the anal- ysis of Staphylococcus aureus omics data. International Journal of Medical Microbiology, 305 (3), pp. 339-347. This file was downloaded from: https://eprints.qut.edu.au/83576/ c Consult author(s) regarding copyright matters This work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the docu- ment is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recog- nise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to [email protected] License: Creative Commons: Attribution-Noncommercial-No Derivative Works 2.5 Notice: Please note that this document may not be the Version of Record (i.e. published version) of the work. Author manuscript versions (as Sub- mitted for peer review or as Accepted for publication after peer review) can be identified by an absence of publisher branding and/or typeset appear- ance. If there is any doubt, please refer to the published source. https://doi.org/10.1016/j.ijmm.2015.02.001

Upload: others

Post on 06-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

This may be the author’s version of a work that was submitted/acceptedfor publication in the following source:

Broadbent, James, Sampson, Dayle, Broszczak, Daniel, Upton, Zee, &Huygens, Flavia(2015)Choose wisely: Network, ontology and annotation resources for the anal-ysis of Staphylococcus aureus omics data.International Journal of Medical Microbiology, 305(3), pp. 339-347.

This file was downloaded from: https://eprints.qut.edu.au/83576/

c© Consult author(s) regarding copyright matters

This work is covered by copyright. Unless the document is being made available under aCreative Commons Licence, you must assume that re-use is limited to personal use andthat permission from the copyright owner must be obtained for all other uses. If the docu-ment is available under a Creative Commons License (or other specified license) then referto the Licence for details of permitted re-use. It is a condition of access that users recog-nise and abide by the legal requirements associated with these rights. If you believe thatthis work infringes copyright please provide details by email to [email protected]

License: Creative Commons: Attribution-Noncommercial-No DerivativeWorks 2.5

Notice: Please note that this document may not be the Version of Record(i.e. published version) of the work. Author manuscript versions (as Sub-mitted for peer review or as Accepted for publication after peer review) canbe identified by an absence of publisher branding and/or typeset appear-ance. If there is any doubt, please refer to the published source.

https://doi.org/10.1016/j.ijmm.2015.02.001

Page 2: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

Choose wisely: Network, ontology and annotation resources for the analysis of Staphylococcus aureus omics data Short Title: Information resources for Staphylococcus Aureus omics analysis

Broadbent JA*1, 3, Sampson DL1, 3, Broszczak DA1, 3, Upton Z1, 3, Huygens F2, 3 1Injury Prevention and Trauma Management, 2Chronic Disease and Aging, Institute of Health and Biomedical Innovation, Queensland University of Technology, Brisbane Australia 3School of Biomedical Sciences, Faculty of Health, Queensland University of Technology, Brisbane Australia

*Corresponding author Dr James A Broadbent Institute of Health and Biomedical Innovation Queensland University of Technology, Kelvin Grove Campus

60 Musk Avenue Brisbane 4059, QLD, Australia P: +61 7 3138 6201 F: +61 7 3138 6030

E: [email protected] KEYWORDS Staphylococcus aureus, MRSA, Omics, Molecular Network, Gene Ontology, Gene Ontology Annotation, Systems Biology, Non-model Organism; Bioinformatics

1

Page 3: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

ABSTRACT Staphylococcus aureus (S. aureus) is a prominent human and livestock pathogen investigated widely using omic technologies. Critically, due to availability, low visibility or scattered resources, robust network and statistical contextualisation of the resulting data is generally under-represented. Here, we present novel meta-analyses of freely-accessible molecular network and gene ontology annotation information resources for S. aureus omics data interpretation. Furthermore, through the application of the gene ontology annotation resources we demonstrate their value and ability (or lack-there-of) to summarise and statistically interpret the emergent properties of gene expression and protein abundance changes using publically available data. This analysis provides simple metrics for network selection and demonstrates the availability and impact that gene ontology annotation selection can have on the contextualisation of bacterial omics data.

2

Page 4: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

INTRODUCTION The rapid evolution of Staphylococcus aureus (S. aureus) combined with over-use/misuse of antibiotics has seen this organism transition from a treatable inconvenience to a major threat to global health and economic security (Editorial, 2013). Infections caused by this organism range from mild to moderate skin and soft tissue infections to severe invasive infections of the bone, lungs and heart. These infections lead to mortality in 20 – 50% of cases (Miro et al., 2005) and underpin a US$86 billion burden on American patients (Filice et al., 2010; Klein et al., 2013) and $830 million – $9.7 billion on the American healthcare system per annum (Klein et al., 2007). Whilst once confined to healthcare settings, drug-resistant S. aureus is now widespread in the community (Nimmo et al., 2013; Nimmo et al., 2006). In addition to drug resistance, these community-associated strains display signs of enhanced virulence when compared to healthcare-associate strains (Gordon and Lowy, 2008; Otto, 2010). Given this phenomenon and the ever-increasing health and financial burden of this organism, S. aureus has been the subject of numerous investigations that seek to understand its biochemistry (Basell et al., 2014; Hessling et al., 2013; Michalik et al., 2012), interpret the mode action of antimicrobial agents (Overton et al., 2011), or identify new antimicrobial targets (Cherkasov et al., 2011). As S. aureus elicits disease through a combination of pathogenic, virulence and antibiotic resistance factors ― traits predetermined by a genetic program and executed by metabolic and functional biochemistry ― this organism has been widely investigated using omics technologies. Proteomics and transcriptomics are cornerstone technologies in systems biology, allowing the investigation of function through complex biomolecular interactions and dynamics. The vast amount of data produced through these approaches necessitates the ability to contextualise broad changes in gene expression or protein abundance. Such contextualisation was for some time limited to the assignment of molecules to specific metabolic pathways or molecular classes. However, more recent advances have integrated robust statistics (Huang da et al., 2009), ontology resources (Ashburner et al., 2000; Ogata et al., 1999) and molecular networks (Overton et al., 2011; Solis et al., 2014) to provide a more comprehensive interpretation of global molecular perturbations. These resource types have for some time represented a core toolkit for discipline-standard interpretation of omics data obtained from model organism investigations. While their uptake in microbiological investigations (and non-model organisms more generally) is not yet routine, robust statistical and network contextualisation of S. aureus omics data have recently been reported in the literature (Conlon et al., 2013; Marbach et al., 2012; Overton et al., 2011; Solis et al., 2014). These investigations highlight both popular and novel resources that may enhance omics investigations. However, the presence of analogous S. aureus information resources, originating from various unique sources, calls into question the quality and value that each provides in terms of its ability to robustly contextualise data. This limits the ability of researchers to select the resources that can meaningfully and efficiently exploit the results of their omics experiments. Here we have amassed novel and previously reported network and ontology resources used for the contextualisation of S. aureus omics data. Through meta-analyses and performance evaluation using publically available micro-array and proteome data sets, we have quantified the ability of these resources to interpret broad changes in gene expression and protein abundance. This analysis forms a guide to the optimal interpretation of S. aureus data based on the empirical measurement of performance. The use of these optimal resources can be expected to enhance data exploitation and thereby lead to new discoveries and improved hypothesis formation and experimental design. Such advancements will lead to a greater understanding of this important pathogen thereby providing necessary information for improved disease prevention, diagnosis, treatment and management. Indeed, the application of robust information and statistical resources will enable more accurate and reliable data generation that will enhance our understanding of the pathogenic and antibiotic resistance mechanisms that S. aureus utilizes in causing severe and life threating human disease.

3

Page 5: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

MATERIALS AND METHODS Network graph collection, unification and analysis Network graphs were acquired from literature and database sources. Literature-sourced networks included the Cherkasov experimental protein-protein interaction network (Cherkasov et al., 2011), Marbach inferred gene regulatory network (Marbach et al., 2012) and the Overton inferred protein-protein interaction network (Overton et al., 2011). Database-sourced networks were acquired from String v9.05 (String-db.org), the Kyoto Encyclopaedia of Genes and Genomes (KEGG; http://www.genome.jp/kegg/), Virulence Factor Database (VFDB; http://www.mgc.ac.cn/VFs/; release 3) and RegPrecise v3.1 (regprecise.lbl.gov/RegPrecise/). The Marbach and String networks were obtained in formats compatible with Cytoscape import and downstream analysis, while the remaining networks required some level of construction and/or ordered locus name (OLN) unification. De novo construction of networks was performed for the KEGG, VFDB and RegPrecise datasets. The KEGG and VFDB networks were constructed by associating OLNs and their respective classification terms obtained from the KEGG BRITE functional hierarchy and virulence factor classification, respectively. RegPrecise gene regulatory network data were available only for the related strain N315. Consequently, N315 OLNs were mapped to MU50 OLNs using the homology search at the J. Craig Venter Institute - Comprehensive Microbial Resource (JCVI-CMR). The new OLNs were then used to construct a network connecting MU50 OLNs based on the RegPrecise gene regulatory networks for S. aureus. Additional network unification was required for the Cherkasov and Overton networks, which were both originally constructed using the related strain MRSA-252 (Cherkasov et al., 2011; Overton et al., 2011). For the Cherkasov protein-protein interaction network, RefSeq identifiers were mapped to OLNs using the ID Mapping tool at the UniProtKB (http://www.uniprot.org/uniprot/). The resulting MU50 OLNs were then obtained using the homology search at the JCVI-CMR. The Overton network MRSA-252 OLNs were mapped to MU50 OLNs at the JCVI-CMR. Following unification, networks were assessed for their genome coverage (including the proportion of OLNs not found in other networks), virulence factor coverage and gene regulatory network coverage. All networks were imported into Cytoscape v2.8.2 to visualize network topology and key properties (Shannon et al., 2003). Ontology analysis was performed in Cytoscape using the BiNGO application as described in the following sections. Detailed network information is provided in supplementary information (Supplementary Table S1).

Information resource collection, unification and meta-analysis Information resources were acquired, extracted or created from various sources. The gene ontology (GO) was acquired from the Gene Ontology project (http://www.geneontology.org/; 04/04/2013), while the MU50 UniProt gene ontology annotations for cellular component, molecular function and biological process were acquired from the UniProt-GOA (www.ebi.ac.uk/GOA; 04/04/2013 ) by filtering for S. aureus MU50 taxonomy (Tax ID: 158878). The Affymetrix gene ontology annotations for cellular component, molecular function and biological process were extracted from the Affymetrix platform (GPL1339) data table obtained from the National Centre for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) using an in-house R script. Gene ontology annotations for cellular component, molecular function and biological process were extracted from the JCVI-CMR using the genome search tool. Annotations against The Institute of Genome Research (TIGR) roles ontology were also extracted from this source. TIGR and KEGG ontologies were then constructed using the TIGR roles and KEGG BRITE functional hierarchy classifications, respectively. Ontology annotations mapped against the GO were quality controlled using CateGOrizer (http://www.animalgenome.org/bioinfo/tools/catego/) to identify obsolete or alternate GO terms. These were then mapped to current or dominant terms, respectively. TIGR role terms with low value (e.g. unknown function) were not used in subsequent meta-analysis. Term and gene-term overlap

4

Page 6: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

between gene ontology annotations was assessed using Venn Diagram Generator (http://www.pangloss.com/seidel/Protocols/venn.cgi). Quantitative Venn diagrams were created using Google Chart. Summary information regarding gene ontologies and gene ontology annotations are supplied as supplementary information (Supplementary Tables S2 and S3). Omics data and analysis Publically-available micro-array data sets were used to evaluate the performance of gene ontology annotation resources. In this regard six experiments were acquired from the NCBI GEO under experiment accession numbers GDS1666, GDS2105, GDS2812, GDS2814, GDS2983 and GDS3136. To facilitate ontology analysis, MU50 strain protein accession numbers were manually filtered from the Affymetrix platform (GPL1339) data table obtained from the GEO. The resulting 2629 open reading frames (ORFs) were then applied to each data file as input gene names for further analysis. Expression data for these 2629 ORFs were then log2 transformed and analysed using permutation statistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters used included two class unpaired test, median centre normalisation and 1000 permutations. All other parameters were set to default. The resulting data were then filtered to obtain genes with a fold change of ≥ 2 at a false discovery rate (FDR) of < 5%. Up and down regulated genes were written to a new worksheet and accession numbers converted to OLNs using an in-house identifier retrieval system. Experiments with multiple time points were split into multiple unpaired binary experiments comparing baseline measures to any proceeding gene expression measures. This resulted in nine SAM analyses and 27 sets of differentially regulated genes (an up regulated, down regulated and all regulated gene set per SAM analysis). Proteome data for four comparative experiments were retrieved from the supplementary information of three original research articles (Fuchs et al., 2010; Overton et al., 2011; Rivera et al., 2012) and arranged into lists of up-regulated, down-regulated and all-regulated proteins without additional quantitative data processing. The protein accession numbers were mapped to the MU50 strain using the AureoWiki PanGenome database (www.protecs.uni-greifswald.de/aureowiki). Performance evaluation of ontology resources Lists of differentially expressed genes were imported into Cytoscape v2.8.2 to create 27 orphan node networks. These networks were analysed using 13 gene ontology annotation resources in combination with their relevant gene ontologies (as described in Supplementary Table S3), resulting in 351 analyses. Over-representation analysis was performed using the Biological Network Gene Ontology (BiNGO) v2.44 Cytoscape application (Maere et al., 2005) in batch mode by means of Hypergeometric statistics combined with Benjamini and Hochberg FDR correction. The whole annotation was used as the reference set and the corrected p-value for significant over-representation was purposefully set to > 0.05 in order to observe the broad performance of each gene ontology annotation. This performance was visualised by plotting the proportion of a given cluster (total gene collection annotated against an ontology term) against the –log10(p-value) as a measure of ontology term over-representation occurring at random. The frequency with which each annotation was able to identify emergent properties of the datasets was determined and graphed in GraphPad Prism. Differences between the performances of the annotations were determined using one-way ANOVA statistics with Tukey’s post-test or non-parametric equivalents where data showed a non-Gaussian distribution as seen in the transcriptome data. The frequency that differentially expressed genes were not annotated in a given gene ontology annotation (and therefore did not contribute to the Hypergeometric statistical analysis) were also extracted from the BiNGO output files as a measure of annotation performance. Proteome data were analysed with BiNGO in batch mode using statistical tests identical to that described above. The frequency of significant over-representation of ontology terms was determined for all protein lists as well as the concordance of the significant term identification that resulted from analysis using different annotations against the GO. RESULTS and DISCUSSION

5

Page 7: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

S. aureus network resources have variable representation of genomic, regulatory network and virulence factor coverage Network graphs are a powerful means to represent complex relationships between individual components relevant to an entity. These diagrams are used in systems biology to represent interaction, classification and/or functional relationships between molecules in a biological system, and can be extended to the interactions across multiple systems (e.g. the interaction between bacterial and mammalian cells). These graphs can be information-rich with the ability to display multiple layers of data in a relatively simple two-dimensional format allowing for the delivery of both the quantitative and qualitative information common to omics and multi-omics experiments. In addition, network graphs provide the ability to interrogate omics data within the context of molecular interactions and thereby allow new biological inferences and discoveries to be made. S. aureus network graphs are uncommon in the literature although their use has increased over the last five years (Marbach et al., 2012; Overton et al., 2011; Solis et al., 2014). This low representation in the literature may be due to several factors, including: a lack of protein interaction data or ability to retrieve such data; the visibility of network resources; format incompatibility; exclusion from popular commercial databases; or, rigid strain-specific annotations. Despite these factors, a suite of S. aureus network resources are available that have the potential to add substantial value to the display, and particularly analysis, of S. aureus omics data, although their completeness and key parameters have not been previously evaluated. S. aureus network graphs were constructed or gathered from multiple open source repositories and analysed for genomic, regulatory network and virulence factor coverage (Figure 1). The analyses revealed wide variation in these key properties between different networks. In terms of genome coverage, the array of networks covers between 3 and 82% of known genes (Figure 1, large pie charts; Supplementary Table S1). Further analysis of this genomic coverage demonstrated that each network graph contains unique genes not mapped elsewhere. The highest frequency of these unique genes was found in the Overton protein-protein interaction network (Figure 1B). Through ontology analysis, the unique genes of the Overton network were found to be enriched within 13 cellular components, 52 biological processes and five molecular functions (p < 0.05), thereby demonstrating the superiority of this network in representing the S. aureus genome and annotated biochemistry. In addition to genomic coverage, gene regulatory networks were evaluated across the available network graphs. The association of omics data with known gene regulatory networks provides an important means to assess the activation of genetic response elements (Michalik et al., 2012). Like genomic coverage, the completeness of these regulatory networks varies across the available network graphs (Figure 1, upper small pie charts; Supplementary Table S1). Importantly, given current regulatory network knowledge, the complete known S. aureus regulons can be represented in a single network graph (RegPrecise), while other available graphs provide between 6 and 91% coverage. The third key parameter assessed across these network graphs was the coverage of virulence factors. Genes responsible for virulence factors provide organisms with the ability to rapidly colonise, migrate and replicate in the host whilst evading and/or suppressing the host’s immune response. As such, networks that possess a complete repertoire of these virulence factors will provide the greatest potential to assess the perturbation of these critical molecules. In this respect, the information found at the VFDB was sufficient to construct a comprehensive network graph of virulence factors, while their coverage varies between 2 and 68% across the remaining network graphs (Figure 1, lower small pie charts; Supplementary Table S1). Other key information can be found in the available network graphs. For instance, the KEGG Pathway Maps were readily constructed into a molecular classification network of metabolic pathways. This network places 931 OLNs into a modular graph where the full interaction of any single gene across multiple processes is captured in full. As such, this network provides the ability to

6

Page 8: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

visualise and examine omics data in the context of well characterised metabolic pathways, which is not possible using any of the other networks assessed in this study. Several of the classification network graphs described in this section have also appeared in S. aureus publications in the form of Voronoi tree maps (Becher et al., 2009; Hessling et al., 2013; Michalik et al., 2012). Voronoi tree maps are space filling diagrams where each tile represents a single gene and each cluster of tiles represents a molecular classification. These graphs are constructed using gene ontology hierarchical structure and have been optimised for the simultaneous visualisation of molecular abundance and classification of omics data (Bernhardt et al., 2009). Their ready compatibility with a variety of gene ontologies provides the opportunity to visualise multiple aspects of the same dataset with relative ease. Where these graphs are optimised for intuitive display of abundance and classification information, network graphs are optimised for the analysis of relational information in the context of molecular classification and abundance data. As such, network graphs offer a higher data-ink ratio, but can suffer in terms of their carriage of information due to information density when compared to Voronoi tree maps. The power of network graphs, however, lies in their ability to integrate expression data with novel interaction networks in order to make new systems-level discoveries (Overton et al., 2011). In this regard, these two approaches may be seen as complementary; where Voronoi tree maps provide optimal visual display, network graphs provide optimal relational information and systems biology discovery capability. Overall, the suite of freely available network graphs presented herein contain valuable information, now quantified in terms of genome coverage (and unique genes), regulon coverage, virulence factor coverage, metabolic pathways and inferred interactions. Used in combination, these information resources permit the ability to visually contextualise and display almost the entire S. aureus genome, including the complete suite of virulence factors and known gene regulatory network participants. Furthermore, when combined with quantitative omics data, these graphs can also facilitate the discovery of new interactions, gene functions and drug targets (Cherkasov et al., 2011; Overton et al., 2011). Using this analysis as a guide, investigators can now select the graph or graphs that will enable the comprehensive discovery, display and exploitation of their valuable omics data. Popular S. aureus gene ontology annotation resources have variable genome coverage and low gene-term association concordance Information science describes an ontology as a controlled vocabulary of definitions or terms that comprehensively delineates the concepts relevant to an entity whilst capturing the hierarchical relationship between the terms (Ashburner et al., 2000). In the context of systems biology, ontologies commonly describe the molecular functions, cellular components and biological processes that are associated with gene products. They can also be developed to describe any class of molecule and/or biological information, such as signalling pathway involvement, metabolite class or disease association. Ontologies are essential in systems biology investigations as they make use of prior knowledge to give context to the broad gene expression or protein abundance changes quantified in omics experiments. Several ontologies have been developed that define concepts relevant to S. aureus, including the GO, KEGG, RegPrecise and the J. Craig Venter Institute Comprehensive Microbiological Resource Roles (JCVI-CMR; also known as the TIGR annotation; recently deprecated at the JVCI-CMR; Supplementary Table S2). These ontologies have varying degrees of resolution, and therefore contextualisation, as demonstrated through their term frequency and hierarchical structure (Table S2), which can lead to challenges when visualising and communicating results from the use of ontologies with deep hierarchical structure. Critically, the annotation of genes against these ontologies has not been previously quantified and as such their potential to contextualise omics data has been unclear. Gene ontology annotations provide the link between individual gene products and their ontological terms and act as a bridge between quantitative and qualitative information. It is a substantial undertaking to annotate an organism’s complement of gene products against an ontology of

7

Page 9: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

thousands of terms (three ontologies in the case of GO), as such the GO project has previously placed much focus on annotating model organisms in the first instance. This has meant that many species were semi-annotated by the discipline-leading consortium, including S. aureus. More recent efforts, however, have seen the annotation of 46 S. aureus strains with an average of 69.44±0.03% genome coverage by the UniProt GO project (15/05/2014) (Barrell et al., 2009). In addition to UniProt, several other organisations have curated gene ontology annotations for this important pathogen (Supplementary Table S3), although their individual worth in terms of providing the most comprehensive and valuable data interpretation has not been previously examined. KEGG (Becher et al., 2009; Chung et al., 2013; Ham et al., 2010), TIGR (Chaffin et al., 2012; Chang et al., 2006; Song et al., 2012) and three GO-associated annotations (Affymetrix (Marbach et al., 2012), UniProt (Enany et al., 2012; Ham et al., 2010; Lee et al., 2009; Solis et al., 2014) and JCVI-CMR (Peterson et al., 2001)) have been reported in the literature or are available for the analysis of S. aureus omics data. These annotations cover biological process, molecular function and cellular component descriptions and vary greatly in their genomic coverage (Figure 2; Supplementary Table S3). Biological process and related annotations, for instance, cover between 22 and 60% of the S. aureus genome, with each of five analogous annotations containing genes not annotated elsewhere (Figure 2A). The J. Craig Venter Institute biological process GO annotation contained the highest frequency of these unique genes. Upon ontology analysis of the 284 genes using the JCVI-CMR GO annotation, this gene set showed limited enrichment under any specific GO terms, suggesting that the inclusion of these additional genes is not due to a term-centric annotation effort. When specifically considering the genome coverage of GO-associated annotations, there is a common trend for high overlap between Affymetrix and UniProt annotations, as compared to annotations from the JCVI-CMR (Figure 2B-D; Supplementary Figure S1). By further focussing on the genes common to all three annotations (UniProt, Affymetrix, JCVI-CMR), the gene-term assignments from each of the annotators were revealed to have very low concordance (Figure 2E-G; Supplementary Figure S1). These data suggest differences in annotation strategy of the three annotators. Furthermore, it suggests that selection of a particular annotation will have a substantial influence on the interpretation of data, even though the annotations describe the same organism, share genome coverage and are built on the same ontology. This paradigm is critical given the use of a variety of annotations in the literature and is best tested through the evaluation of empirical data. The UniProt biological process gene ontology annotation out-performs most related popular information resources when interpreting publically-available gene expression data Gene ontology and annotation resources were evaluated to explore their value in interpreting experimental gene expression data in order to identify significant emergent properties. This was done using six micro-array gene expression data sets acquired from the NCBI GEO database. Expression analysis using permutation statistics resulted in the identification of nine sets of differentially expressed genes (Figure 3A) and 27 sets of genes when classing data as all regulated genes, up regulated genes and down regulated genes. Gene sets were then subjected to ontology over-representation analysis using the ontologies and annotations described in the materials and methods and highlighted in Table S3 of the supplementary information. Performance of the resources was evaluated in terms of: the ability to identify significantly over-represented terms; the number of genes excluded from Hypergeometric analysis due to no gene-term annotation; and, the consistency of analogous resources to produce significant results when analysing the same data set.

Graphs of the proportion of observed genes (as a percent of the cluster of total genes relevant to a term) versus the probability of random over-representation were produced to summarise the performance of multiple gene ontology annotations (Supplementary Figure S2). In this format Hypergeometric statistics combined with multiple hypothesis correction tended to show a distinct relationship between cluster representation and the probability of random identification, although instances of 100% representation and p-value > 0.05 were observed (Supplementary Figure S2).

8

Page 10: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

These identifications resulted from clusters that contained between one and two genes. More broadly, this phenomenon demonstrates that while annotations may have a similar frequency of gene-term associations and genomic coverage, such as that provided by Affymetrix and UniProt, it is their annotation quality that is principal in the identification of statistically-relevant ontology terms. This quality can be assessed by examining the ability of an annotation to consistently resolve co-ordinated biochemistry. The ability of the various ontology annotations to identify over-represented terms was evaluated using the 27 sets of differentially regulated genes. In terms of resolving biochemistry, the UniProt biological process annotation identified significantly more over-represented terms than the JCVI, TIGR and KEGG annotations across the nine gene sets (p-value < 0.05 - 0.001; Figure 3D). The TIGR annotation demonstrated the best performance as measured by consistently identifying > 0 significant terms; however, as the TIGR ontology contains the smallest number of terms and just two levels of information hierarchy, the resolution of biological information could be anticipated to be low. In this regard the UniProt annotation, given its ability to consistently identify significantly more ontology terms, should be considered as the first option in S. aureus omics data contextualisation. The agreement between the results of over-representation analysis when using either the UniProt, Affymetrix or JCVI-CMR biological process GO annotations was evaluated (Figure 3E). In this experiment, 27 sets of genes were subjected to ontology analysis and FDR correction that provided 22 gene sets with > 0 over-represented ontology terms and 66 results files. When comparing the three sets of results for each of the 22 experiments it became clear that the use of each annotation, while made against the same ontology and for the same organism, led to very distinctive term identifications. Indeed, of the 22 gene sets, the analysis of only four gene sets led to universal agreement in the identification of over-represented terms. Critically, even in these four experiments the frequency of agreement was low (16 - 32%). In the same analysis, 16 experiments led to the identification of between 3 and 100% of terms common to two ontology analyses and 20 experiments resulted in significant terms that were identified in only one of the three ontology analyses. These results confirmed the hypothesis that GO annotation selection is highly influential in the interpretation of omics data and that annotation selection must be informed by empirical performance in order to generate statistically meaningful information. In addition to identifying over-represented ontology terms, this analysis quantified the performance of ontology annotations with regard to the number of missed or unannotated genes that were found within nine sets of all differentially expressed genes (Supplementary Figure S5). This analysis identified that the greatest frequency of missed differentially expressed genes resulted from analyses using the JCVI-CMR or KEGG annotations, followed by the Affymetrix and UniProt annotations. Analysis using the TIGR annotation led to the lowest frequency of missed genes; however, this may not translate to valuable information, as noted above. Interestingly, the KEGG annotation, while having a high proportion of missed genes, had the second best performance in terms of identifying co-ordinated biochemical events when compared to the other four biological process-related annotations. This result speaks to the quality of the KEGG annotation due to its basis in well characterised metabolic pathways. As may be expected, the frequency of missed differentially regulated genes was shown to correlate with annotation genome coverage (r2 = 0.97; p-value = 0.0016; Supplementary Figure S5). In addition to the resources investigated herein, researchers may also consider the use of alternative annotation and statistical tools. The DAVID bioinformatics server, for instance, can offer the opportunity to annotate and statistically examine gene sets using a simple point-and-click interface (Conlon et al., 2013; Ham et al., 2010). This resource offers support for a range of S. aureus strains and the ability to perform simultaneous enrichment analysis against 40 functional annotation categories. Importantly, such a large number of annotations can result in an overwhelming amount of data that requires the extraction and summarising of important features, followed by re-integration with network or Voronoi tree map graphs in order to convey the meaning of molecular

9

Page 11: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

changes. Implementation of the resources as demonstrated herein permits simple ontology analysis and display within the context of both abundance data and relational information using a single interface. Such an approach is readily and freely available, and when coupled with high quality gene ontology annotation resources, can add substantial value to the analysis of omics data sets. Cellular component and molecular function gene ontology annotations have limited ability to resolve coordinated biochemistry in S. aureus gene expression data Nine differentially regulated gene sets were analysed for cellular component and molecular function ontology over-representation. In terms of cellular component analyses, all annotations demonstrated an equivalent ability to identify significant terms (Figure 3C). Both annotations had superior performance with respect to the JCVI-CMR annotation, wherein they identified > 0 significant terms (Figure 3C). In this regard, the JCVI-CMR cellular component annotation was shown to have substantially lower genome coverage, which is likely to be responsible for this outcome. All annotations exhibited equivalent performance in the identification of over-represented molecular function ontology terms within the micro-array data sets, often resulting in the identification of no over-represented terms (Figure 3B). Importantly, the molecular functions and cellular components of individual genes will not necessarily be co-ordinated in response to an external S. aureus perturbation, as compared to biological processes, and may contribute to the perceived poor performance of these annotations in this analysis. The poor performance of cellular component GO annotations may indicate a necessity to engage other tools for the classification of molecules. Indeed, there have been reports of using LocateP, PSORT, InterPro, DOLOP, Auger and SecretomeP for S. aureus data analysis (Enany et al., 2012; Hempel et al., 2011; Lee et al., 2009; Solis et al., 2010; Ventura et al., 2010; Ziebandt et al., 2010). These tools use physio-chemical, sequence and taxonomic statistical models to determine cellular localisation, whereas GO annotations use a combination of experimental and in silico methods for molecular classification. Critically, the GO annotations used in this investigation show low genomic coverage, even though current knowledge of molecular localisation may be higher across the S. aureus genome. For example, the reported number of membrane proteins in the literature for SACOL is 580 (Otto et al., 2014), whereas the UniProt GOA reports 417 for the same strain. This disparity is not uncommon in microbial organisms and speaks to the themes in a 2009 Nature Reviews - Microbiology editorial regarding the sociology of microbial genome annotation (Welch and Welch, 2009). Given time and effort the cellular component annotation will reach adequate genomic coverage and resolution in order to provide a valuable localisation capability. Over-representation analysis using different gene ontology annotations results in similar significant term frequency but low concordance in the analysis of proteome data Twelve protein sets from S. aureus proteome experiments were analysed for enrichment of molecular function, cellular component and biological process definitions using gene ontology over-representation analysis. With regards to the absolute frequency of significant terms, there was no difference between any of the annotators for biological process and molecular function analysis (Supplementary Figure S3A-C). In terms of the cellular component biochemistry, both the Affymetrix and UniProt annotations performed significantly better than JCVI. In this respect the performance results of the proteome molecular function analysis reflected the results of the transcriptome analysis as did the cellular component result, while the biological process results showed no difference in performance between the annotations. This difference may be attributed to the means of proteome data acquisition, which can result in differences in data completeness. In this regard, transcriptome profiling using micro-array technology will result in a quantitative measurement of every gene contained within the array platform while, generally speaking, mass spectrometry data acquisition uses algorithms that result in the most abundant proteins being identified and quantified

10

Page 12: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

at the expense of low abundant species. The property of being highly abundant intrinsically lends itself to functional characterisation due to being more readily observable or through involvement in known integral biochemistry that requires high abundance. As such, highly abundant proteins are more likely to be annotated regardless of annotator, therefore resulting in a similar frequency of over-represented biological process terms. Critically however, the analysis of proteome data again demonstrates that the choice of GO annotation will lead to the detection of different emergent properties within the same data set as demonstrated through the low concordance of significant terms using Affymetrix, UniProt and JCVI annotations (Supplementary Figure S3D-F), although this agreement appears to be better in proteome data analysis than transcriptome analaysis. Ultimately, as proteome observations become more complete using data-independent acquisition approaches, then the annotation performance seen for micro-array data will become more relevant to proteome data, thereby indicating that the UniProt biological process annotation, with its combination of high genome coverage, high resolution ontology and superior identification of emergent properties in transcriptome data, offers the most robust option for identifying emergent properties in S. aureus omics data. Concluding remarks Many high quality analyses of the S. aureus transcriptome and proteome have been reported that involve data categorization and visualisation; however, robust relational and statistical contextualisation of S. aureus omics data has remained generally under-represented. Critically, different ontology annotations will perform differently in different biological settings, certainly when seeking to only classify genes/proteins and in the interpretability of the ontological definitions. However, when identifying the emergent properties of a biological system through over-representation analysis, particular annotations will routinely perform significantly better due to their basis in highly resolved biological properties. Without undertaking this type of analysis, changes in a system cannot be differentiated from random occurrences with certainty. By measuring and reporting the characteristics and the performance of the available networks, ontologies and annotations we provide the best empirical evidence for using one resource over another, which can now be used to select resources that are most likely to benefit a given analysis. Critically, these resources are readily implemented in an open source, point-and-click interface, and are fully customisable and adaptable to a variety of annotated organisms. In this regard, this paper is hoped to mark a definitive and universal transition in the approach to omics data analysis for S. aureus (as well as other non-model organisms) where data is displayed, interrogated and interpreted in order to maximise novel and valuable discoveries that will support the continued fight against this important pathogen.

11

Page 13: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

ACKNOWLEDGMENTS Funding for the work was supported by the Institute of Health and Biomedical Innovation 2013 Early Career Researcher Grant Scheme and the Wound Management Innovation Cooperative Research Centre. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript The authors acknowledge and thank Daniel Marbach, James Costello and Gustavo Stolovitzky for their advice and discussions regarding the production of custom ontologies and ontology annotations and their use in the BiNGO application. The Authors further thank Liz Leddy for proofing, editing and critical discussions of the manuscript. REFERENCES Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G., 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-29. Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O'Donovan, C., Apweiler, R., 2009. The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Res 37, D396-403. Basell, K., Otto, A., Junker, S., Zuhlke, D., Rappen, G.M., Schmidt, S., Hentschker, C., Macek, B., Ohlsen, K., Hecker, M., Becher, D., 2014. The phosphoproteome and its physiological dynamics in Staphylococcus aureus. Int J Med Microbiol 304, 121-132. Becher, D., Hempel, K., Sievers, S., Zuhlke, D., Pane-Farre, J., Otto, A., Fuchs, S., Albrecht, D., Bernhardt, J., Engelmann, S., Volker, U., van Dijl, J.M., Hecker, M., 2009. A proteomic view of an important human pathogen--towards the quantification of the entire Staphylococcus aureus proteome. PLoS One 4, e8176. Bernhardt, J., Funke, S., Hecker, M., Siebourg, J., 2009. Visualizing Gene Expression Data via Voronoi Treemaps. 2009 Sixth International Symposium on Voronoi Diagrams, 233-241. Chaffin, D.O., Taylor, D., Skerrett, S.J., Rubens, C.E., 2012. Changes in the Staphylococcus aureus transcriptome during early adaptation to the lung. PLoS One 7, e41329. Chang, W., Small, D.A., Toghrol, F., Bentley, W.E., 2006. Global transcriptome analysis of Staphylococcus aureus response to hydrogen peroxide. J Bacteriol 188, 1648-1659. Cherkasov, A., Hsing, M., Zoraghi, R., Foster, L.J., See, R.H., Stoynov, N., Jiang, J., Kaur, S., Lian, T., Jackson, L., Gong, H., Swayze, R., Amandoron, E., Hormozdiari, F., Dao, P., Sahinalp, C., Santos-Filho, O., Axerio-Cilies, P., Byler, K., McMaster, W.R., Brunham, R.C., Finlay, B.B., Reiner, N.E., 2011. Mapping the protein interaction network in methicillin-resistant Staphylococcus aureus. J Proteome Res 10, 1139-1150. Chung, P.Y., Chung, L.Y., Navaratnam, P., 2013. Identification, by gene expression profiling analysis, of novel gene targets in Staphylococcus aureus treated with betulinaldehyde. Res Microbiol 164, 319-326. Conlon, B.P., Nakayasu, E.S., Fleck, L.E., LaFleur, M.D., Isabella, V.M., Coleman, K., Leonard, S.N., Smith, R.D., Adkins, J.N., Lewis, K., 2013. Activated ClpP kills persisters and eradicates a chronic biofilm infection. Nature 503, 365-370. Editorial, 2013. The antibiotic alarm. Nature 495, 141. Enany, S., Yoshida, Y., Magdeldin, S., Zhang, Y., Bo, X., Yamamoto, T., 2012. Extensive proteomic profiling of the secretome of European community acquired methicillin resistant Staphylococcus aureus clone. Peptides 37, 128-137. Filice, G.A., Nyman, J.A., Lexau, C., Lees, C.H., Bockstedt, L.A., Como-Sabetti, K., Lesher, L.J., Lynfield, R., 2010. Excess costs and utilization associated with methicillin resistance for patients with Staphylococcus aureus infection. Infect Control Hosp Epidemiol 31, 365-373. Fuchs, S., Mehlan, H., Kusch, H., Teumer, A., Zuhlke, D., Berth, M., Wolf, C., Dandekar, T., Hecker, M., Engelmann, S., Bernhardt, J., 2010. Protecs, a comprehensive and powerful storage and analysis

12

Page 14: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

system for OMICS data, applied for profiling the anaerobiosis response of Staphylococcus aureus COL. Proteomics 10, 2982-3000. Gordon, R.J., Lowy, F.D., 2008. Pathogenesis of methicillin-resistant Staphylococcus aureus infection. Clin Infect Dis 46 Suppl 5, S350-359. Ham, J.S., Lee, S.G., Jeong, S.G., Oh, M.H., Kim, D.H., Lee, T., Lee, B.Y., Yoon, S.H., Kim, H., 2010. Powerful usage of phylogenetically diverse Staphylococcus aureus control strains for detecting multidrug resistance genes in transcriptomics studies. Mol Cells 30, 71-76. Hempel, K., Herbst, F.A., Moche, M., Hecker, M., Becher, D., 2011. Quantitative proteomic view on secreted, cell surface-associated, and cytoplasmic proteins of the methicillin-resistant human pathogen Staphylococcus aureus under iron-limited conditions. J Proteome Res 10, 1657-1666. Hessling, B., Bonn, F., Otto, A., Herbst, F.A., Rappen, G.M., Bernhardt, J., Hecker, M., Becher, D., 2013. Global proteome analysis of vancomycin stress in Staphylococcus aureus. Int J Med Microbiol 303, 624-634. Huang da, W., Sherman, B.T., Zheng, X., Yang, J., Imamichi, T., Stephens, R., Lempicki, R.A., 2009. Extracting biological meaning from large gene lists with DAVID. Curr Protoc Bioinformatics Chapter 13, Unit 13 11. Klein, E., Smith, D.L., Laxminarayan, R., 2007. Hospitalizations and deaths caused by methicillin-resistant Staphylococcus aureus, United States, 1999-2005. Emerg Infect Dis 13, 1840-1846. Klein, E.Y., Sun, L., Smith, D.L., Laxminarayan, R., 2013. The changing epidemiology of methicillin-resistant Staphylococcus aureus in the United States: a national observational study. Am J Epidemiol 177, 666-674. Lee, E.Y., Choi, D.Y., Kim, D.K., Kim, J.W., Park, J.O., Kim, S., Kim, S.H., Desiderio, D.M., Kim, Y.K., Kim, K.P., Gho, Y.S., 2009. Gram-positive bacteria produce membrane vesicles: proteomics-based characterization of Staphylococcus aureus-derived membrane vesicles. Proteomics 9, 5425-5436. Maere, S., Heymans, K., Kuiper, M., 2005. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21, 3448-3449. Marbach, D., Costello, J.C., Kuffner, R., Vega, N.M., Prill, R.J., Camacho, D.M., Allison, K.R., Kellis, M., Collins, J.J., Stolovitzky, G., 2012. Wisdom of crowds for robust gene network inference. Nat Methods 9, 796-804. Michalik, S., Bernhardt, J., Otto, A., Moche, M., Becher, D., Meyer, H., Lalk, M., Schurmann, C., Schluter, R., Kock, H., Gerth, U., Hecker, M., 2012. Life and death of proteins: a case study of glucose-starved Staphylococcus aureus. Mol Cell Proteomics 11, 558-570. Miro, J.M., Anguera, I., Cabell, C.H., Chen, A.Y., Stafford, J.A., Corey, G.R., Olaison, L., Eykyn, S., Hoen, B., Abrutyn, E., Raoult, D., Bayer, A., Fowler, V.G., Jr., 2005. Staphylococcus aureus native valve infective endocarditis: report of 566 episodes from the International Collaboration on Endocarditis Merged Database. Clin Infect Dis 41, 507-514. Nimmo, G.R., Bergh, H., Nakos, J., Whiley, D., Marquess, J., Huygens, F., Paterson, D.L., 2013. Replacement of healthcare-associated MRSA by community-associated MRSA in Queensland: confirmation by genotyping. J Infect 67, 439-447. Nimmo, G.R., Coombs, G.W., Pearson, J.C., O'Brien, F.G., Christiansen, K.J., Turnidge, J.D., Gosbell, I.B., Collignon, P., McLaws, M.L., 2006. Methicillin-resistant Staphylococcus aureus in the Australian community: an evolving epidemic. Med J Aust 184, 384-388. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., Kanehisa, M., 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27, 29-34. Otto, A., van Dijl, J.M., Hecker, M., Becher, D., 2014. The Staphylococcus aureus proteome. Int J Med Microbiol 304, 110-120. Otto, M., 2010. Basis of virulence in community-associated methicillin-resistant Staphylococcus aureus. Annu Rev Microbiol 64, 143-162. Overton, I.M., Graham, S., Gould, K.A., Hinds, J., Botting, C.H., Shirran, S., Barton, G.J., Coote, P.J., 2011. Global network analysis of drug tolerance, mode of action and virulence in methicillin-resistant S. aureus. BMC Syst Biol 5, 68.

13

Page 15: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

Peterson, J.D., Umayam, L.A., Dickinson, T., Hickey, E.K., White, O., 2001. The Comprehensive Microbial Resource. Nucleic Acids Res 29, 123-125. Rivera, F.E., Miller, H.K., Kolar, S.L., Stevens, S.M., Jr., Shaw, L.N., 2012. The impact of CodY on virulence determinant production in community-associated methicillin-resistant Staphylococcus aureus. Proteomics 12, 263-268. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., Ideker, T., 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13, 2498-2504. Solis, N., Larsen, M.R., Cordwell, S.J., 2010. Improved accuracy of cell surface shaving proteomics in Staphylococcus aureus using a false-positive control. Proteomics 10, 2037-2049. Solis, N., Parker, B.L., Kwong, S.M., Robinson, G., Firth, N., Cordwell, S.J., 2014. Staphylococcus aureus Surface Proteins Involved in Adaptation to Oxacillin Identified Using a Novel Cell Shaving Approach. J Proteome Res. Song, Y., Lunde, C.S., Benton, B.M., Wilkinson, B.J., 2012. Further insights into the mode of action of the lipoglycopeptide telavancin through global gene expression studies. Antimicrob Agents Chemother 56, 3157-3164. Tusher, V.G., Tibshirani, R., Chu, G., 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98, 5116-5121. Ventura, C.L., Malachowa, N., Hammer, C.H., Nardone, G.A., Robinson, M.A., Kobayashi, S.D., DeLeo, F.R., 2010. Identification of a novel Staphylococcus aureus two-component leukotoxin using cell surface proteomics. PLoS One 5, e11634. Welch, R., Welch, L., 2009. If you build it, they might come. Nat Rev Micro 7, 90-90. Ziebandt, A.K., Kusch, H., Degner, M., Jaglitz, S., Sibbald, M.J., Arends, J.P., Chlebowicz, M.A., Albrecht, D., Pantucek, R., Doskar, J., Ziebuhr, W., Broker, B.M., Hecker, M., van Dijl, J.M., Engelmann, S., 2010. Proteomics uncovers extreme heterogeneity in the Staphylococcus aureus exoproteome due to genomic plasticity and variant gene regulation. Proteomics 10, 1634-1644.

14

Page 16: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

FIGURES

Figure 1 | Depiction of available S. aureus network graphs and their respective chromosomal OLN coverage, regulatory network coverage and virulence factor coverage for rapid visual comparison. Network visualisation and properties for (A) the Cherkasov protein-protein interaction network, (B)

15

Page 17: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

Overton protein-protein interaction network, (C) Marbach gene regulatory network, (D) RegPrecise gene regulatory network, (E) KEGG BRITE metabolic functional hierarchy network, (F) String 9.05 protein-protein interaction network and (G) VFDB virulence factor classification network. Teal nodes in the network graphs correspond to proteins involved in gene regulatory networks. Nodes with red boarders correspond to virulence factors and black nodes signify non-OLN nodes. Dark blue nodes correspond to genes with functions note described in the RegPrecise or Virulence Factor Databases. Pie charts display the chromosomal OLN coverage of each of the networks in dark blue (with proteins unique to each network in blue-grey), regulatory network coverage in teal and virulence factor coverage in red. Further details can be found in the supplementary information.

16

Page 18: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

Figure 2 | Genome coverage of biological process gene ontology annotations and the concordance of GO term assignment across Affymetrix, UniProt and JCVI resources. (A) Genome coverage across biological process-related ontology annotations including the frequency of genes uniquely annotated in each gene ontology annotation. The genome overlap of genes annotated under GO molecular function (B), cellular component (C) and biological process (D) for Affymetrix (dark blue), UniProt (teal) and the JCVI-CMR (red) ontology annotations. The region showing the genes in common are designated by a white asterisk. The overlap of GO term assignment for the common genes across Affymetrix, UniProt and the JCVI-CMR for (E) molecular function, (F) cellular component and (G) biological process ontology annotations are shown in the lower Venn diagrams. Absolute values are detailed in the supplementary information (Supplementary Figure S1).

17

Page 19: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

18

Page 20: F. - QUTeprints.qut.edu.au/83576/1/83576_IJMM_Final.pdfstatistics in the SAM (Statistical Analysis of Micro-arrays; Stanford University) v4.00 R add-on (Tusher et al., 2001). The parameters

Figure 3 | Performance of gene ontology annotation resources in the interpretation of publically available micro-array data sets. Nine data sets were analysed with SAM to reveal unique target genes sets with differential expression represented by a scale of red-white-blue, with red bars representing up-regulated genes and blue bars representing down-regulated genes (A). Regulated gene sets were analysed using available ontology and ontology annotation resources to identify over-represented ontology terms within the data. Analysis across the twenty-seven gene sets demonstrated no difference in performance for molecular function annotation resources (B) or cellular component resources (C). In terms of biological processes, the UniProt annotation was able to identify a higher frequency of over-represented ontology terms, compared to KEGG, TIGR, Affymetrix and JCVI-CMR (D). GO biological process annotations from Affymetrix, UniProt and JCVI-CMR infrequently identified the same over-represented gene ontologies when interpreting identical sets of down-regulated, up-regulated and all regulated genes across the nine micro-array experiments (D). In each experiment over-represented terms were either identified by only one annotation (dark blue), two annotations (teal) or all three annotations (red).

19