homepages.cs.ncl.ac.ukhomepages.cs.ncl.ac.uk/anil.wipat/home.formal/projectinfo/sams t… · web...

S. J. Lycett i

Interaction Network Integration using Bayesian data fusion

methodsS. J. Lycett31 August 2007

MRes Bioinformatics

Supervisors: Dr. A. Wipat, Dr. J. Hallinan

Word Count = 25060

S. J. Lycett ii

Abstract

High-throughput, small scale and expert curated interaction data sets can be combined to provide a fuller picture of the relationships between genes and their products, reduce experimental noise, and enhance weak interactions present in multiple data sources. Data integration is problematic because it is difficult to assess the data quality from diverse experiment types in a consistent manner. Previous studies have performed the integration by scoring the data sets against an expert-curated Gold Standard data set, however these approaches risk biasing the resulting functional interactions towards those in the Gold Standard. In this work, a novel method for data integration based on Bayesian Data Fusion was developed. Multiple reference data sources were used to score experimental interaction data before a final integration was performed. Reference data derived from the Kyoto Encylopedia of Genes and Genomes, Gene Ontology annotations and the Munich Information Center for Protein Sequences were used to together with Saccharomyces cerevisiae data sets from BioGRID to create an integrated functional network. The utility of this integrated functional network for prediction of gene function was confirmed for genes on known pathways. Functional predictions for uncharacterised genes / proteins were made, and a comparison with previous studies showed that the integrated functional network developed here was able to predict the function of more uncharacterised genes / proteins at a higher level of detail than previously described methods.

Acknowledgements

I would like to thank my supervisors, Neil Wipat and Jennifer Hallinan for the technical discussions and helpful advice.

I am most grateful to Malcolm Farrow for advice on Bayesian probability measures, Matthew Pocock for Java programming advice, and Phil Lord for general advice and support.

Many thanks to my father, Dr. J. E. Lycett for personal support throughout the course.

Declaration

I declare that this dissertation represents my own work except where otherwise stated.

S. J. Lycett 31 August 2007

S. J. Lycett iii

S. J. Lycett iv

Contents

1 Introduction 11.1 Aims and Objectives 11.2 Scope 21.3 Project Management 21.4 Outline 3

2 Background 42.1 Interactions 42.2 Interaction Experiments and Databases 52.3 Representation of Networks 92.4 Methods for the creation of Integrated Functional Networks 102.5 Motivation for the study 13

3 Methods and Data 143.1 Data Sources 143.2 Computational Methods 203.3 Evaluation Methods 21

4 Results and Discussion 224.1 Introduction 224.2 Interaction Types 224.3 Network Integration 334.4 Properties of the Integrated Network 48

5 Conclusions 606 Future Work 617 References 62

7.1 Literature 627.2 URLs 66

8 Appendix i8.1 BioGRID Networks by Study and Small Scale Studies i8.2 Taverna Workflows used to create KEGG Networks iii8.3 Lee Log Likelihood Score iv8.4 Algorithm Implementation iv

S. J. Lycett 1

1 Introduction

Cells can be viewed as complex systems of interacting parts (Kitano 2002). To understand the processes which occur within a cell, it is important to know which genes and gene products are present and how they interact (Cuisk 2005). Thus networks capturing the relationships between genes or their products are important tools for Systems Biology (Barabasi 2004). The development of high-throughput experimental techniques has meant that not only can the genome of an organism be sequenced; but that expression levels of thousands of genes can be measured at once; and the interactions between thousands of proteins can be found in parallel. Integrating the results of diverse experiments yields a fuller picture of the role of genes and their products within the cell (Searls 2005). In particular, integrating diverse interaction data means that a functional network capturing the function relationships between genes or their proteins can be created (von Mering 2002, Lee 2004). Since the interactions between genes or their products help to define their function, the function of an uncharacterised protein can be inferred from its characterised interaction partners (Oliver 2000).

The Baker’s yeast Saccharomyces cerevisiae is a model organism often used to study processes within a eukaryotic cell as it is easily grown in a laboratory but displays many of the essential cellular processes present in higher eukaryotes such as humans (Ideker 2001). In this project data from many Saccharomyces cerevisiae experiments will be integrated to obtain a set of interactions covering the entire genome, and predictions about the genes with unknown function will be made. Data integration is problematic because it is not clear how to measure and compare the accuracy of individual experiments since diverse experimental methods are used (Bork 2004, Hart 2006). Hence a major part of this project is the development of a new method for data integration.

1.1 Aims and Objectives

The aim of this project is to:

Develop a principled method for creating integrated functional networks from a diverse range of data sources including high-throughput experimental results and expert-curated interaction data.

The specific objectives of this project are:

1. Investigate differences between functional links from data from different experiment types

2. Develop a method for assessing the quality of diverse experimental data sets

3. Develop a data fusion method for integrating the data

4. Create an integrated functional network for Saccharomyces cerevisiae

5. Evaluate the integrated functional network and make functional predictions

S. J. Lycett 2

1.2 ScopeExperimental data and expert-curated interactions for Saccharomyces cerevisiae deposited in publicly available databases were used for the investigation and final integration. The integration method was developed from mathematical concepts informed by data analysis. An implementation of the method in a high level programming language enabled a functional integrated network to be created. Methods to evaluate the functional integrated network were also implemented in the high level programming language, and functional predictions for un-annotated genes were made.

1.3 Project ManagementThe work was divided into three main phases: initial data analysis integration method development; and data integration and functional network evaluation. The time-scales for the phases of the work and the final writing phase are shown in Figure 1-1.

Figure 1-1: Project Plan

The phases of the work were broken down as follows:

Initial data analysis Collect experimental data and expert-curated data (reference data) from the

databases Develop and implement a method to convert the data into a suitable format Develop and implement methods to measure similarity and differences

between the information content of functional links and networks from different data sources.

Integration method development Develop and implement a method to select a set of reference data sources Develop and implement a method to combine experimental data sets with a

reference data source Develop and implement a method to perform the final integration using

multiple reference data sources

Data integration and functional network evaluation Integrate experimental data with the reference data using the method

developed Implement a standard method to compare the integrated network to a manually

curated interaction network Implement a method to extract functional predictions from the integrated

network and compare the predictions with those produced by existing methods

S. J. Lycett 3

1.4 Outline

The background to this project is described in chapter 2, covering the basic concepts used: interactions; types of experimental data and databases; representation of networks; and methods for creating integrated functional networks. In section 2.5 the detailed problem addressed by this project is described. Chapter 3 contains a description of the data sets; the basic computational methods; and evaluation methods used in the project.

The three phases of the project: initial data analysis; integration method development; and integration and evaluation form the three major subsections of chapter 4. In each subsection the results are presented and discussed. The conclusions of the project are presented in chapter 5 and suggestions for future work are in chapter 6. References and URLs are in section 7. Finally the Appendix in section 8 contains additional detail on the BioGRID data used, how the Kyoto Encylopedia of Genes and Genomes (KEGG) reference networks were created, differences between the log-likelihood ratio score used here and that used by another author, and listings of the main integration algorithms developed using MATLAB.

S. J. Lycett 4

2 Background

Cataloguing the entire set of genes in a genome is an important step in understanding and characterising an organism. However, an understanding of how the organism functions can be gained by knowledge of what those genes do and how their products interact in a range of circumstances (Ideker 2001, Tyers & Mann 2003). Rather than focus on the activities of individual genes or their products, Systems Biology aims to understand organisms as a whole by examining the relationships between all the constituent genes and proteins (Ideker 2001, Kitano 2002b). There are many ways in which the relationships between the constituent parts and processes within a cell can be revealed, consequently a diverse range of experimental techniques have been developed. Importantly for Systems Biology, high throughput techniques have been developed to measure interactions between thousands of genes or their products (Ideker 2001). Combining the results from different studies offers a way to improve the understanding of the interactions within a cell, but integrating the wealth of this heterogeneous data from diverse sources is a major challenge in bioinformatics (Searls 2005).

Currently the research in Systems biology has two aspects : finding the structure of system as a network of interacting parts (for example identifying genes which form a functional module); and deducing the dynamic interactions between genes, RNA and proteins, for example modelling how the expression level of a set of genes change over time (Bruggeman & Westerhoff 2007). The main aim of this project is to develop a method of compiling the network of interactions between genes or their products to enable the function of previously uncharacterised genes within the cell to be understood.

2.1 InteractionsAn interaction between genes or their products can have many forms. The most direct form of interaction is when proteins bind physically to each other, for example two globin proteins binding to two globin proteins to make haemoglobin. For one protein to bind physically to another, the three-dimensional structures of the proteins must be compatible, and binding site have highly specific shapes so that only those particular types of proteins can bind together. Since the DNA sequence of a gene determines the amino acid sequence of the protein, and the amino acid sequence largely determines the three-dimensional structure of the protein, if two proteins have a physical interaction then their genes can be thought of as interacting at an abstract level. If the binding of one protein to another is a part of a particular biological process or function, then they have a functional interaction. Additionally if the two proteins do not bind directly to each other, but nevertheless are part of the same multiple protein complex, and that complex has a particular biological function, then those proteins have a functional interaction (Gavin 2002). For example proteins that make up ribosomes can be considered to be are functionally linked to each other.

Another type of physical interaction between proteins in a cell is when one protein catalyses the addition or subtraction of a small molecule to a second protein, as in a phosphorylation. The second modified protein might then go on to cause a modification to a third protein. If the first protein was not present, or somehow

S. J. Lycett 5

disabled, then the activity of the third protein would be affected. Since all three proteins are needed in this simple pathway, they are functionally linked.

In addition to a protein interacting with another protein, a protein could bind directly to DNA to enhance or suppress the transcription of some nearby target genes. If the activator or inhibitor protein was not present, the activity of the target genes would be affected. Consequently, there is a functional interaction between the activator or inhibitor protein, and the target genes. Finally, two genes are functionally related if the product of the first gene could wholly or partially substitute for the product of the second gene in a biological process or function, if the second product was not present or mutated.

The functional interactions mentioned above fall into two main types: physical interactions between proteins; and genetic interactions. Since the interaction mechanisms are quite different between the various functional interaction types, different experimental techniques are needed to detect the interactions (Bork 2004).

2.2 Interaction Experiments and Databases

2.2.1 DatabasesAutomated or semi-automated high-throughput experimental techniques enable the detection of interactions between thousands of genes in parallel (e.g. see Spellman 1998, Gavin 2002, Tyers & Mann 2003, Cuisk 2005 and section 2.2.2). So clearly the detailed results of high-throughput experiments cannot all be reported in a scientific paper. Fortunately, as with similar to gene sequencing studies, the results of many high-throughput studies are submitted to public databases as part of the publication process. Additionally, the results from small studies (where interactions from only a few genes were measured) can extracted from the literature and included in a database by the curators even if the results were not directly submitted. Typically a database stores the names of the genes involved in the interaction (systematic, common or own internal identifiers), a reference to the original paper in which the results were published and the experiment type. Thus a network of interacting genes / proteins can be created from the interaction data by collecting the all the interactions between genes / proteins from a particular study or experiment type.

Public databases containing collections of experimental interaction data include:[BIND] Biomolecular Interaction Network Database (Bader 2001)[BioGRID] BioGRID (Stark 2006)[DIP] Database of Interacting Proteins (Xenarios 2002)[IntAct] IntAct (Hermjakob 2004)[MIPS] Munich Information Center for Protein Sequences (Guldener 2006)

MPact: Representation of Interaction Data at MIPS

As well as databases containing the individual experimental results, some databases contain classification schemes derived by human experts from sequence, structure, interaction data and other data. As these databases contain high quality information, and attempt to capture particular aspects of the current state of knowledge of a biological system, they are often used as sources of reference data, so called ‘Gold Standards’. These reference databases include the Kyoto Encylopedia of Genes and Genomes, ([KEGG], Kanehisa & Goto 2000, Kanehisa 2006), the Gene Ontology Project ([GO], Ashburner 2000) and the classifications at the Munich Information Center for Protein Sequences ([MIPS] , Mews 1999, Guldener 2006).

http://mips.gsf.de/genre/proj/yeast/index.jsp

http://www.geneontology.org/

http://www.genome.ad.jp/kegg/

http://mips.gsf.de/genre/proj/mpact/

http://www.ebi.ac.uk/intact/site/index.jsf

http://dip.doe-mbi.ucla.edu/dip/Download.cgi

http://www.thebiogrid.org/

http://bond.unleashedinformatics.com/

S. J. Lycett 6

2.2.1.1 Kyoto Encylopedia of Genes and Genomes

The Kyoto Encyclopedia of Genes and Genomes [KEGG] is a well known database containing information about genes, enzymes and metabolites on known pathways (Kanehisa & Goto 2000, Kanehisa 2006). Reference functional interaction data can be deduced from the data in KEGG, for example Lee et al. (Lee 2004) and Yamanishi et al. (Yamanishi 2004) have used the KEGG PATHWAYS database to create a reference network in which genes on the same pathway are linked and no genes which are on different pathways are linked.

2.2.1.2 Gene Ontology Project

The Gene Ontology project [GO] provides a controlled vocabulary and relationship structure between the terms to describe biological processes, cellular components, and molecular functions occurring within cells (Ashburner 2000). Except for the three root terms (biological process, cellular component, and molecular function), each GO term in the vocabulary consists of a unique identifier, name, description, and a list of immediate parent terms. GO terms with many ancestors refer to highly specific processes, components or functions.

Genes with known function in the Saccharomyces Genome Database [SGD] have been annotated with GO terms, and a list of gene-GO term associations can be downloaded from [GO]. Hence networks of genes can be created, in which the genes are linked if they share a GO term annotation, or if one is a more specific child of the term. For example Kiemer et al., (Kiemer 2007) and Myers et al. (Myers 2006) generate networks relating genes via their biological process GO terms.

2.2.1.3 Munich Information Center for Protein Sequences

In addition to the high-throughput experimental data collected at MIPS [MIPS], this database also contains curated data sets describing gene and protein enzyme classification, functional assignments, complex membership and cellular localisation (Mews 1999, Guldener 2006). The MIPS Functional Catalogue (FunCat) contains 28 functional hierarchies, and genes are manually assigned to the categories by the MIPS team (Ruepp 2004). Some groups, such as Antonov et al., (Antonov 2006) and Deng et al. (Deng 2003) use MIPS FunCat as a ‘Gold Standard’, assuming that if the genes are in the same functional category then they (or their products) interact. MIPS Complexes is also a popular choice for a gold standard reference data set (e.g. Jansen 2003, Kiemer 2007, Lu 2005). It contains 66 complex types each of between 1 and 266 proteins.

2.2.2 Experiment Types

There are a wide variety of experimental techniques used to measure interactions between genes or proteins, and each database has a slightly different scheme for recording the techniques used to generate the interaction data. However, the common types of experiment and the types of interaction measured are described in the subsections below. A glossary of terms used to describe experiments can also be found in the Saccharomyces Genome Database [SGD Glos].

2.2.2.1 Physical Interaction ExperimentsPhysical interaction experiments measure protein – protein interactions, and typically involve a ‘target’ protein physically binding to a ‘bait’ protein. The resulting complex

http://www.yeastgenome.org/help/glossary.html



http://www.yeastgenome.org/



S. J. Lycett 7

is separated (e.g. by precipitation) or detected (e.g. by some type of physical change due to a tag) from the remaining un-bound proteins, and target proteins identified if necessary (e.g. by mass spectrometry, gel bands or other means). Physical interaction techniques include:

Co-immunoprecipitationThe basic principle of co-immunoprecipation is to use an antibody to bind to a bait protein. If a target protein then binds to the bait protein attached to the antibody, the resulting complex precipitates from solution. A range of techniques, including mass spectrometry, specific RNA binding, or western blotting, can subsequently identify the proteins in the complex. Co-immunoprecipitation techniques are also known as Affinity Capture techniques.

Tandem Affinity PurificationTandem Affinity Purification is an important type of co-immunoprecipitation technique. Instead of using specific antibodies to bind to the bait proteins, bait proteins are tagged with an immunoglobin G binding domain, to allow the proteins to bind to immunoglobin beads contained in an affinity column, and hence be precipitated. If the tagged protein forms part of a complex, then the complex is precipitated. The proteins within the complex are separated by gel electrophoresis and identified by mass spectrometry. Tandem Affinity Purification is more suitable for the precipitation of, and therefore detection of, multi-protein complexes than co-immunoprecipitation. Both Gavin and co workers (Gavin 2002) and Ho and co workers (Ho 2002) used Tandem Affinity Purification on over 1000 yeast proteins to identify multi-protein complexes.

Yeast Two HybridGenes that code for ‘bait’ proteins are fused to a DNA binding domain and inserted into one strain of yeast. ‘Target’ proteins are fused to a DNA activation domain for a reporter gene and inserted into another strain of yeast. When the two strains of yeast are crossed, if the bait protein binds to the target protein in the resulting offspring, then the DNA binding and activation domains will also bind. The combined DNA binding and activation domains regulate a reporter gene, and the reporter gene causes an easily observable change in the phenotype of the offspring (for a review see e.g. Mukherjee 2001). The Yeast Two Hybrid (Y2H) experimental system has been used by several groups (e.g. Ito 2001, Tong 2002, Uetz 2000) to perform large scale protein-protein interaction studies on S. cerevisiae. Although Y2H is a very useful high throughput technique, it is prone to systematic false positives and negatives when considering functional interactions. The systematic errors occur because the target and bait proteins are expressed together in the nucleus of the hybrid yeast. However if the proteins do not naturally occur in the same cellular compartment, or same time in the cell cycle, then they may not interact in vivo, even though they might bind in the Y2H experiment. Similarly if the proteins do not naturally occur in the nucleus, they might not bind in the experiment even though they do interact in vivo.

Fluorescence Resonance Energy TransferFluorescence Resonance Energy Transfer (FRET) is another type of physical interaction experiment that does not involve physical separation of the bound protein complex. Instead proteins are tagged with fluorescent markers, which emit a characteristic wavelength (under excitation conditions) when the proteins bind to each other.

S. J. Lycett 8

2.2.2.2 Genetic Interaction Experiments

Genetic interactions are inferred by measuring the effect of pairs of gene mutations on the vitality of the organism. For example if mutants with only Gene A or Gene B disabled are still viable, but mutants with both Gene A & B disabled die, then Gene A and Gene B are assumed to be functionally linked via the so called genetic interaction of their products. Tong and co workers (Tong 2001) created separate strains of S. cerevisiae, each one with a deletion mutation in a different gene. Of the 6200 genes mutated, 4700 were not lethal. The strains containing non-lethal mutations were crossed with each other in a double mutation synthetic lethal screen experiment. If the double mutant offspring was non-viable, then it was inferred that the original genes and their products (before mutation) were functionally related. Other variants of the synthetic lethal screen include synthetic growth effect and synthetic rescue. A synthetic rescue experiment is the opposite of synthetic lethal experiment, in that cells with two mutations survive (one mutation cancels out the other) but cells with only one mutation die.

2.2.2.3 mRNA expression levelsMessenger RNA is produced as a result of gene transcription, and as precursor to protein synthesis. Measuring the amount of mRNA present in a cell is indicative of the activity of the gene and the amount of protein being produced. Messenger RNA in a sample can be hybridised to a DNA microarray slide, which contains an array of different DNA probe molecules complementary to the mRNAs of interest. Microarray slides contain thousands of DNA probes, so the expression levels of all the genes in S. cerevisiae can be measured at once. By measuring the difference in expression levels in the set of genes in the sample under particular conditions compared to control conditions, or the change in expression levels over time (e.g. in a cell cycle), correlations in the activity pattern of the genes can be found (e.g. see Spellman 1998). Interactions between genes are inferred from microarray data by making two assumptions. Firstly it is assumed that a correlation in the activity pattern implies that the genes are co-regulated. Secondly it is assumed that co-regulated genes are functionally related. Microarray studies are extremely useful and convenient for studying the expression of whole genomes, but the interactions inferred from them can have a large false positive rate. In addition to any false positives caused by the technology (e.g. hybridisation problems, weak signals) and data processing (e.g. parameters in correlation and clustering analysis used), false positives occur because the above assumptions do not necessarily hold.

2.2.2.4 Localisation ExperimentsThe location of proteins within a cell (at a particular time instant) can be revealed by tagging proteins with a green fluorescent marker protein and imaging the cell. Alternatively, the protein can be tagged with a short protein sequence (epitope), and an antibody containing a green fluorescent marker can be introduced into the cell to bind to the epitope attached to the protein. The co-location of proteins within the cell may indicate that they have a shared function, while proteins located in different cellular compartments are less likely to share the same function. For an example of a localisation technique, see the study of yeast protein localisation by Huh and co workers (Huh 2003).

S. J. Lycett 9

2.3 Representation of NetworksTo gain an understanding of how the thousands of genes and proteins in an organism interact with each other, the interactions need to be represented in a form amenable to computational processing.

A set of interacting genes or gene products can be represented as a network (graph), in which the nodes (vertices) are the genes (or their products) and the links (edges) between the nodes represent interactions. For simplicity, a single node in the network represents a gene and its product(s), if any. Also, a link between nodes is interpreted as genes interacting via their products either as protein-protein or protein-DNA interactions. A very simple example of a network is displayed in Figure 2-2 where gene A is linked to genes B and C, and gene C is linked to gene D.

Figure 2-2: Example of a very simple network (left) and its Adjacency matrix (right).

Networks also have a convenient mathematical / computational representation as an adjacency matrix (A), so called because it captures the nearest neighbour information in the graph. The adjacency matrix of the simple network is also shown in Figure 2-2. If a direct link between the ith node (e.g. gene A) and the jth node (e.g. gene B) exists, then the value of the element in the ith row and jth column of the array, A( i, j ) equals 1. If there is no direct link between the nodes, then the value of the A( i, j ) equals 0. In some biological experiments only the presence or absence of an interaction is measured, so links either exist or do not exist, and the adjacency matrix consists of ones and zeros only. However for other types of experiments a weight (expressing the degree of confidence) or probability (p) can be assigned to an interaction, and the link can be represented as the weighted adjacency matrix element A( i, j ) = p.

Note that the network as shown in Figure 2-2 has undirected links, because it has been assumed that if gene A interacts with gene B (via the products of the genes) then gene B must interact with gene A. Consequently, the adjacency matrix of this network is symmetric (about the diagonal). Protein – protein interaction networks are assumed to contain undirected links and therefore have symmetrical adjacency matrices. However, if the links between nodes were directed, as in a genetic regulatory network where gene / protein A could inhibit gene B, but not vice versa, then the adjacency matrix would be asymmetrical – i.e. the yellow elements in Figure 2-2 would not be the same as their orange counterparts. In a functional network the links are assumed to be undirected, since they are generally composed of multiple data sources, some of which will usually be undirected.

Representing the set of interacting genes or their products as a network enables the overall structure of the system to be examined mathematically / computationally. Important topological properties used to characterise networks include the degree

B

C

A D

ABCDA0110B1000C1001D0010

S. J. Lycett 10

distribution and clustering coefficients (Barabasi 2004). The degree distribution is the distribution of the number of links per node over all the nodes in the network. The shape of the degree distribution can help to characterise the network. For example, biological networks are thought to contain a few highly connected nodes, known as hubs (Jeong 2000). These hubs often represent proteins essential to the viability of the organism (Jeong 2001, Przulj 2004). The clustering coefficient measures the density of links in the neighbourhood of a node. For example nodes representing proteins which form a complex would have high clustering coefficients, because each node would have neighbours that were linked to each other (Przulj 2004). Biological networks also tend to have high clustering coefficients compared with equivalently-sized, randomly connected networks.

2.4 Methods for the creation of Integrated Functional NetworksThe different experimental techniques used to measure interactions each have their own strengths, weaknesses and error rates, and each experiment performed may focus on one particular set of genes. However, by combining the results from different experiments, a more complete picture of the functional interactions occurring within an organism can be obtained (von Mering 2002, Bork 2004, Joyce 2006). If the experiments have some degree of overlap, the results from one experiment can be verified (or refuted) by the others, and the number of erroneous interactions can be reduced. Furthermore, interactions too weak to be detected reliably by only one method might be revealed after integrating the results from several experiments. Since the interactions that a gene (or its product) participates in helps to define its function within the cell, predictions about the function of an un-characterised gene can sometimes be made by considering the function of its characterised interaction partners (Oliver 2000).

There are two main methods used to perform data integration – (1) the individual data sets are weighted and combined; and (2) the presence (absence) of an interaction (or type of interaction) is learned or inferred from a collection of data sets.

In the first method, reference data containing known interactions (‘Gold Standard’) is used to assess the quality of the individual data sets to be integrated – those data sets that ‘match’ the reference data set well are assigned a high weight (for example see Lee 2004). If the metric describing how well the data sets match the reference data is based upon a statistical estimate of probabilities derived from Bayes theorem (a likelihood ratio), and the scored networks are combined according to probability rules of Bayesian inference [Bayesian Inference Def], then this technique is known as Bayesian data fusion. Note that Bayesian data fusion is used in other fields, for example integrating surveillance data from multiple sensors such as radar, sonar and infra-red imaging and much of the same formalism applies (see for example Koks & Challa 2005).

In the second approach, Bayesian classifiers (e.g. Camoglu 2006, Lu 2005, Srinivasan 2006), neural networks, support vector machines (e.g. Yellaboina 2007) or other machine learning techniques can be trained to recognise which combinations of the data gives rise to the known interactions in the reference data. In this context of supervised learning, a machine learning algorithm is trained by automatically setting and adjusting internal parameters, such that when some interaction data is input, the ‘true’ interactions are output (according to the reference network). Once the internal

http://en.wikipedia.org/wiki/Bayesian_inference

S. J. Lycett 11

parameters have been learnt, the ‘true’ interactions present in the rest of the data can be inferred. However, a different class of machine learning algorithms can infer links between genes without using training data in this way. These unsupervised learning algorithms adjust their internal parameters to maximise the consistency between the input data sets and thus infer the link probabilities between genes. Examples of unsupervised machine learning techniques include Bayesian network inference (Heckerman 1995) and Markov Random Fields (Jaimovich 2005).

To perform the data integration, or to test the quality of the integrated network, reference data sets, also known as ‘Gold Standards’, containing known interactions are required. Gold Standard interaction data sets representing the currently understood ‘ground truth’ are generally derived from the highly manually curated reference databases as described in 2.2.1. Typical examples of reference data sets used as gold standards are given in Table 2-1.Table 2-1: Gold Standard Datasets

Database Interaction Data Protein / Gene Links ReferenceKEGG Metabolic Pathways On same pathway Kanehisa 2006GO Biological Process Share GO classification terms Ashburner 2000

Molecular FunctionCellular Components

MIPS Complexes In same complex Güldener 2006FunCat In same functional category

A survey of the reference data sets used by various groups as positive gold standards – capturing the expected links between genes / proteins, and negative gold standards – describing the absence of interactions between genes / proteins, is displayed in Table2-2. The integration methods used are also displayed in Table 2-2. Note that Rhodes and co-workers and Yellaboina and co-workers perform their integration on Human and E. coli data respectively, rather than S. cerevisiae.Table 2-2: Survey of Gold Standards and Integration methods used in the literature.

Author PositiveGold Standard

NegativeGold Standard

Method

Antonov 2006 MIPS FunCat Data FusionJansen 2003 MIPS Complexes Different Cellular

CompartmentsData Fusion

Kiemer 2007 Own literature curated set

Own literature curated set

Data Fusion

Lee 2004 KEGG Sub-cellular localisation data

Data Fusion

Rhodes 2005 Human Protein Reference Database

GO Plasma membrane and Nucleus components

Data Fusion

Ulitsky 2007 Data Fusion based on network topology

Myers 2005 Own expert created gold standard

Own expert created gold standard

Data Fusion

Hwang 2005 Data consistency scoringCamoglu 2006 SCOP Bayesian ClassifierLu 2005 MIPS Complexes Different Cellular

CompartmentsBayesian Classifier

S. J. Lycett 12

Huttenhower2006

Biological Process GO

Bayesian Network Inference

Srinivasan 2006

KEGG Network inference (supervised)

Yamanishi2004

KEGG Network inference (supervised)

Deng 2003 MIPS FunCat Markov Random FieldsJaimovich 2005

Markov Random Fields

Yellaboina2007

EcoCyc (manually curated)

Secreted and Cytoplasmic proteins

Support Vector Machine

The main advantages to data fusion methods, where data sets are weighted and combined, are that: the procedure is relatively simple and quick to implement and run and that the weights are biologically meaningful data quality measures. The machine learning techniques are generally more complicated, take longer to train and run, and it can be difficult to work out which features of the data are important to the final classifier. One of the main conceptual differences between the two approaches lies in where and how the biological meaning (‘domain knowledge’) of the integration procedure is captured. When performing data fusion, the domain knowledge is encapsulated by the choice of weighting and combining methods at the time of the algorithm design, hence the actual implementation and running is relatively straightforward. However, when using machine learning techniques, the algorithms act like a biological expert and makes inferences from the training data during run-time, so that all of the domain knowledge is captured in the internal weights of the classifier. For supervised machine learning algorithms, the choice of training data becomes critically important - if the training data does not encompass all the types of interaction that are of interest, then neither will the output network. The unsupervised machine learning algorithms offer a way around this problem, because possible output classes are deduced from the data. However, both supervised and unsupervised machine learning techniques can suffer from lack of easy extensibility, because the classifiers must be retrained, or re-run in order to include a new type of data should one become available.

In Table 2-2 it can be seen that many studies have constructed integrated networks, employing a variety of methods and gold standards. The choice of gold standard is important - Myers and co workers (Myers 2006) found that scoring the same data against different gold standards gave quite different results. Furthermore, it has been suggested that different data sets have different biases (Antonov 2006, Myers 2006). If this is the case, any gold standard used to inform the integration of the data will necessarily bias the resultant network to the type of functional interaction present in the gold standard. Some groups have avoided gold standards altogether, preferring to integrate the data on the basis of common functional modules (Ulitsky 2007) or statistical tests measuring the consistency between the individual data sets (Hwang 2005). Jaimovich and co workers (Jaimovich 2005) tried to infer the interaction probabilities between genes from four data sources, assuming that the probability of interaction could be affected by some hidden variables, such as cellular localisation. However, the number of variables required for the entire S. cerevisiae interactome was too large, so Jaimovich and co workers concentrated on a sub-set of the genes.

S. J. Lycett 13

2.5 Motivation for the study

The interactions between genes or their products can be studied in order to improve the understanding of the role of those genes or their products within cellular processes or structures. Interactions can be measured by a wide variety of experiments. Integrated functional networks combine multiple interaction data sets from diverse sources in a principled fashion to produce a network where an edge and its corresponding weight can be taken as an indication of the combined evidence that some kind of functional interaction exists. An advantage of this approach is that weak lines of evidence for an interaction from multiple data sources can be combined to give a strong indication of a functional interaction. The resulting networks can be studied using computational mechanisms to measure aspects of their topology and global properties, and can also be examined manually to reveal previously unknown functional interactions.

To produce a valid network from multiple data sets, the quality of each dataset must be taken into account when calculating its contribution to the weight on a composite edge. The most commonly applied method for calculating dataset quality is by scoring it against a Gold standard or reference network believed to contain a high quality set of functional interactions.

However, a major problem with this approach is that assessing the quality of diverse experimental data sets against a single gold standard, or using a single gold standard to train a machine learning algorithm, will bias the resulting integrated network to the type of interactions present in the gold standard. Consequently, new methods are required to integrate gene or protein interaction data taking into account the characteristics of the data source being combined and the type of experiment that was used to generate it.

The purpose of this project was to investigate a method of data integration to create a functional network that avoids biasing the integrated functional network when scoring individual data sources against a gold standard.

S. J. Lycett 14

3 Methods and DataIn this chapter the experimental data sets used are described, and methods for compiling the reference data sets from public databases are outlined. Additionally, the computational representation of the data used and the evaluation methods applied to the final integrated network in the subsequent chapters are described. The initial data analysis, subsequent development of the integration method and evaluation of the final integrated network are the subject of the next chapter.

3.1 Data SourcesThe data used in this project was interaction data of the form ‘Gene / Protein A interacts with Gene / Protein B’, where A and B are the systematic gene names. In some cases ‘interacts with’ may have a weight associated with it indicating the strength of the interaction according to some criterion. Two categories of data are considered: experimental data and reference data. Experimental data sets are assumed to be noisy and incomplete, and not all interactions or absence of interactions reported are to be believed. In contrast the interactions that are present in the reference data sets are assumed to have a high probability of representing an aspect of observed biology. However, the absence of an interaction in a reference data set does not necessarily imply that the genes / proteins do not interact.

3.1.1 Experimental Data SetsThe experimental data sets used in this project were downloaded as tab delimited text files from four main sources as described in the following subsections.

3.1.1.1 Lee Data SetsIn 2004, Lee and co workers (Lee 2004) published a highly cited paper on creating a functional interaction network for S. cerevisiae. In this paper data from 11 different data sets were combined to create one integrated functional network. The scored datasets are available for download as supplementary information [Lee Data]. Since these data sets have been well studied, it was decided to use them for the initial analysis in this project. Table 3-3 shows the 11 data sources used in the Lee study together with the number of genes and interactions per network.Table 3-3: Individual Lee data sets

Data Set Name Experiment Type Number of Genes

Number of Interactions

Co-expression High Throughput 5158 90520Co-citation Composite data mining 2397 17567Gavin et al.(2002) Tandem Affinity Purification 1361 3221Ho et al.(2002) Tandem Affinity Purification 1560 3588Ito et al.(2000) Yeast Two Hybrid 1364 1482Phylogenetic profile

Predictions of homology via sequence similarity

2818 67421

Gene fusion Composite experiments 1297 3810DIP small scale Composite (high quality) 1515 2822Tong et al.(2001) Synthetic Lethal 195 275Tong et al.(2002) Yeast Two Hybrid + Predictions 141 211Uetz et al.(2000) Yeast Two Hybrid 934 854

http://www.sciencemag.org/cgi/content/full/sci;306/5701/1555/DC1

S. J. Lycett 15

3.1.1.2 MIPS Experimental Data Sets

Experimental yeast interaction data contained in the MIPS database [MIPS] (Guldener 2006) was downloaded from [MIPS FTP], and the file PPI_18052006.tab was used in this work. Each interaction has one or more experimental evidence codes associated with it (the codes are described in more detail in the evidencecat.scheme file also available from the FTP site). Sub-networks containing interactions with the same evidence code were created from the original data. The experimental evidence codes are hierarchical, with longer codes corresponding to more detailed descriptions. Initial work involving the calculation of network sizes and overlaps was performed to determine the most appropriate level of detail to use. The table below shows the names of the networks compiled from the MIPS interactions database using the evidence codes at the preferred level of detail, together with the number of genes and interactions in each network.Table 3-4: MIPS Experimental data sets

Evidence Type Number of Genes Number of InteractionsPhysical (IPI) 308 262Co-immunoprecipitation 590 955Affinity chromatography 232 208Centrifugation 33 27Gel retardation 51 35Cross linking 31 21In vitro reconstitution 15 13Two hybrid 3979 7780Overlay (e.g. Far-Western Blot) 8 9Genetic (IGI) 197 150Suppression 415 421Synthetic phenotype 1410 5793Genetic experiment type 1045 4813Experiment type 2 1Individual experiment 694 878High throughput experiment 4129 11910

3.1.1.3 BioGRID Data SetsThe BioGRID database [BioGRID] (Stark 2006) contains both high throughput and human curated interaction data from the literature from many species including S. cerevisiae. It is updated monthly. S. cerevisiae data from BioGRID version 2.0.30 was used in this study. Results from both high-throughput and small scale studies are included in the database. For each interaction the name of the author of the original study and the experiment type is recorded. In version 2.0.30 of the database, there are 4855 contributing studies covering 22 experimental categories. Table 3-5 shows the number of studies, genes and reported interactions per experiment type.

Some of the interactions in BioGRID are from large scale experiments, and it was decided that it would be useful to create and analyse these networks separately. For example, as indicated in Figure 3-3, there are only 2 studies out of the 4855 in BioGRID with 10,000 or more interactions (0.04% of the total number of studies); 9 with 1,000 or more interactions (0.19% of the total number of studies); and 45 with 100 or more interactions (1% of the total number of studies). The remaining 99% of studies have less than 100 interactions, while 26% of studies have only one


ftp://ftpmips.gsf.de/yeast/PPI


S. J. Lycett 16

interaction. Consequently, it was decided to create one network per study if the study contained at least 100 reported interactions. The remaining data was compiled into one network per experiment type, making a total of 68 individual data networks. Please see the appendix in section 8.1 for details.Table 3-5: BioGRID data sets by experiment type only

Experiment Type Number of Contributing Studies

Number of Genes

Number of Interactions per Experiment Type

synthetic lethality 996 2337 12257affinity capture-ms 255 3721 38767affinity capture-western 1563 1989 6553dosage rescue 1183 1689 3181reconstituted complex 894 1139 2155synthetic growth defect 544 1556 6285synthetic rescue 846 1200 2065two-hybrid 990 3102 8681biochemical activity 342 1722 5293co-crystal structure 66 118 134far western 24 53 43FRET 8 35 61protein-peptide 29 105 107co-localization 120 262 309affinity capture-rna 11 44 57protein-rna 8 17 10co-purification 256 825 1420co-fractionation 156 443 470dosage lethality 171 394 427phenotypic enhancement 904 1827 17619phenotypic suppression 286 1332 4600dosage growth defect 22 63 44

Figure 3-3: Percentage of studies in BioGRID with at least the given number of interactions.

S. J. Lycett 17

3.1.2 Reference Data Sets

3.1.2.1 Kyoto Encyclopedia of Genes and GenomesIn this project two reference networks were compiled from KEGG PATHWAYS:

‘KEGG’ Genes on the same pathway are linked.‘KEGG Direct’ Genes sharing the same enzyme classification, involved in the

same reaction and on the same pathway are linked.

The KEGG Direct network is intended to reflect the explicit links in a pathway, and therefore contains fewer links than the KEGG network. For example if a pathway consisted of gene A linked to B and B linked to C then the undirected links A-B, B-C and A-C would be present in the KEGG network. However, the KEGG Direct network would only contain the undirected links A-B and B-C.

Data was extracted from the KEGG database via the KEGG webservice interface using Taverna (Hull 2006) in June 2007. The KEGG webservice was used because it is well documented and reliable. Taverna provided a convenient framework to create the workflows to invoke the webservices and use custom text formatters to process the resulting output. See the Appendix section 8.2 for details.

3.1.2.2 Gene OntologyTo generate reference networks from the Gene Ontology database the GO terms were downloaded from [GO] on 16/05/2007 in file gene_ontology_edit.obo (Format Version 1.2, Revision: 5.326), and gene associations with GO terms were downloaded from [GO] on 19/05/2007 in file gene_association.sgd. Custom Java code was written to perform the following procedure for each root term (biological process, cellular component or molecular function):

1) An order of GO terms was created by performing a depth first search over the descendents of the root term. The first child of any term was just the first term found in the input file that was a child of the term being considered.

2) A ‘level’ to each GO term was assigned by :

Set the root term to level 0 (L=0)For L = 0 to 11

(loop through list of go terms) Find GO term P that has level L set immediate children of P to have level L+1 (end loop)

end

If a term had more than one immediate parent, the ‘level’ of the term became the deepest (largest) level assigned.

3) Each gene was associated with one GO term. If a gene was associated with more than one GO term in the original gene association file, then the ‘deepest’ GO term in the GO term order in (1) was used. If the gene was not associated with any GO term then it was associated with the root term.

4) A gene network was created by linking genes which share deepest GO terms of level N and below, where N = 5 – 9.

http://www.geneontology.org/GO.current.annotations.shtml

http://www.geneontology.org/GO.downloads.ontology.shtml

S. J. Lycett 18

3.1.2.3 MIPS

Reference networks were created from MIPS Enzymes and Complexes by downloading genes lists with the same enzyme classification or part of the same complex, from the website (April 2007) and using custom Java code to convert the information to a suitable format. Since the GO biological process, cellular component and molecular function networks had already been created, it was decided not to use MIPS FunCat and Localisation networks also as these would contain similar information.

3.1.2.4 Size and Overlap of Reference NetworksThe reference networks described above do not provide complete coverage of the S. cerevisiae genome, but rather provide particular views on interactions between different sub-sets of genes. Figure 3-4 shows the number of genes and unique interactions for each of the reference networks, and Figure 3-5 shows the number of genes in common between pairs of selected reference networks. Note that all of the reference networks were used for the initial analysis in chapter 4, but subsequent analysis only used the sub-set displayed in Figure 3-5. See 4.3.1 for details on how the sub-set was chosen.

S. J. Lycett 19

Size of Reference Gene Networks

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

0 500 1000 1500 2000 2500 3000 3500Number of Genes

Num

ber o

f Int

erac

tions

Biological Process GO Gene NetworkCellular Component GO Gene NetworkMolecular Function GO Gene NetworkKEGGKEGG DirectMIPS ComplexesMIPS Enzymes

BP(5)

BP(6)

CC(5)

MF(5)

CC(7)

MF(9)

BP(8)

BP(9)

KEGG

KEGG Direct

MIPS Enzymes

MIPS Complexes

Figure 3-4: Size of reference networks. The figures in brackets indicate the highest ‘level’ GO terms in the respective GO networks.

Figure 3-5: Heat map of the number of common genes between selected reference sets and tabulated values

KEGGBP GO 6CC GO 5MF GO 5ComplexesEnzymesKEGG1198627387436472120BP GO 66272208987745605100CC GO 5387987212548569063MF GO 5436745485135126365Complexes472605690263122829Enzymes120100636529175

S. J. Lycett 20

3.2 Computational Methods

3.2.1 Choice of Programming Languages

In order to research and develop network integration methods using the available data on a current desktop computer with a good specification (e.g. 2Gbytes of RAM), the networks were represented in a computationally amenable form as adjacency matrices (see section 2.4). Since S. cerevisiae contains around 6700 genes, a full adjacency matrix containing double precision (32-bit) values would require around 1.4Gbytes (6700 x 6700 x 32) of RAM just to hold a single network. Fortunately it was anticipated that many genes would not interact with each other (e.g. genes in completely different cellular compartments). Consequently, a sparse matrix format in which zero values are omitted was used to reduce the working memory requirements. Efficient matrix manipulation methods, compatible with the sparse matrix format, were also required.

To convert the basic text based interaction data, consisting of pairs of systematic gene names into an adjacency matrix each gene (node) is given a unique numerical index (see 3.2.2). A further refinement was to make a three dimensional data, with the third dimension representing the different types of link e.g. experiment type or study name as unique numerical link codes.

Java was used for the initial data conversion from text based interaction files to gene indices and link codes because it provided convenient and robust text processing and sorting classes and methods. MATLAB [MATLAB] was chosen for the subsequent data analysis and integration development work because it offered excellent in-built and efficient sparse matrix operations, many mathematical and statistical functions, and useful image display functions. The custom Java and MATLAB code written as part of this project are included on the attached CD.

3.2.2 Ordering Genes by GO Terms

The first stage in creating an interaction network is to convert the gene names into numerical indices. A simple way to do this would be to assign indices according to the alphabetical order of the gene names. However, although the order of the nodes in the network is not important in a mathematical analysis, a biologically meaningful order could be useful in a visual analysis.

In section 3.1.2.2 the method of ordering GO terms by depth first search from the three root terms (biological process, cellular component, and molecular function) was described. Considering each root term in turn, genes associated with a descendent GO term of the root can ordered in the same way as the GO terms themselves with the caveats described in Table 3-6. So using the three GO root terms, three biologically meaningful gene orders were derived giving three different sets of gene indices. Table 3-6: Caveats to ordering genes by GO terms

Caveat Condition Description and Result1 Gene has more than one GO term

descended from current root termThe gene is ordered by its ‘deepest’ GO term

2 Gene has some GO terms, but none descended from current root term.

The gene is associated with the current root and ordered accordingly

3 Gene has no GO term annotations at all

The gene is added to the end of the list of genes and will get an unique index > 6300.

http://www.mathworks.com/

S. J. Lycett 21

3.3 Evaluation Methods

3.3.1 Network Performance MetricsTo evaluate the quality of the links in a network, the network can be compared to another known network. The final integrated network was evaluated against a new functional network gold standard derived by Myers and co-workers (Myers 2006) using standard Precision-Recall and Receiver Operator Characteristic curves (e.g. see Deng 2003, Lu 2005, Reguly 2006, Myers 2006).

Assuming that the gold standard contains a set of interactions deemed to be ‘true’ and a set of interactions deemed to be ‘false’, the number of links in a probabilistic network that are greater than or equal to particular threshold (T) which are ‘true’ links (true positives) or ‘false’ links (false positives) were counted. Similarly the number of links that are less than the threshold, but that should represent ‘true’ links (false negatives) or ‘false’ links (true negatives) were also counted.

The number of true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN) were combined into the quantities Precision (also known as Positive Predictive Value), Recall (also known as Sensitivity) and Specificity.

Precision = TP / (TP + FP)Recall or Sensitivity = TP / (TP + FN)Specificity = TN / (TN + FP)

By varying the threshold on the probabilistic network, Precision-Recall curves were plotted. The fraction of true positives ( TP / (TP + FN) ) against false positives ( FP / (FP + TN) ), or equivalently Sensitivity against 1 – Specificity were also plotted in a Receiver Operator Characteristic [ROC] curve as a function of threshold.

3.3.2 Functional Prediction

The final integrated functional network was used to predict the function of a gene by assuming that the function of any given gene is related to the function of its interacting partners (Oliver 2000). For a gene of interest:

All the interaction partners with probability of interaction >=0.9 were listed The deepest GO terms of the interaction partners were sorted by frequency of

occurrence to find the most popular (i.e. frequent) GO terms (a separate list for each of the descendents of biological process, cellular component, molecular function was created).

The most popular GO term in each GO category, or alternatively the five most popular GO terms in each category were used as the functional predictions.

The predictions from the final integrated functional network were validated by comparing the predictions from genes with known annotations to their actual deepest GO annotations.

http://en.wikipedia.org/wiki/Receiver_Operating_Characteristic

S. J. Lycett 22

4 Results and Discussion

4.1 Introduction

One of the goals of systems biology is to gain an understanding of how the cell functions. As a step towards this goal the interactions between genes and / or their products are studied using a wide variety of experimental techniques and systems biologists are interested in combining the results into ‘functional interaction networks’. The term functional interaction is a somewhat loosely defined concept. It could mean that: one gene (or its product) affects the activity the other gene or its product; that the gene products are involved in the same process, complex or metabolic pathway; or that the gene products have the same function (and so could substitute for each other). Given the inclusiveness of the term functional interaction, but the specificity of some of the experimental techniques currently used, the following questions arise:

Do data sets generated by different techniques measure different types of functional interaction ?

How can different data sets best be integrated if they measure different types of functional interaction ?

These questions are investigated in the following subsections 4.2 and 4.3 respectively, where methods are developed to perform the integration. Finally in section 4.4, the integrated functional network is evaluated against a newly derived standard in the literature. The ability of the network to make correct functional predictions is verified and functional predictions for un-annotated genes are made.

4.2 Interaction Types

It is hypothesised that different experimental techniques measure different types of functional interaction.

This issue is addressed by considering:

The extent of similarity between data sets generated by different experimental techniques. This is assessed using the individual data sources in the Lee data set, and also the MIPS and BioGRID data sets compiled by experimental type.

Differences in the type of interactions measured in the individual data sources in the Lee data set.

4.2.1 Similarity between data sets

To investigate whether different experimental techniques measure different types of functional interaction, the similarities and differences in number of interactions measured from the same subset of genes in different experiments are considered in this sub-section, and a statistical similarity measure is described.

S. J. Lycett 23

4.2.1.1 Common interactions in individual Lee Data SetsAs an initial example of the differences between data generated using different experimental techniques the Tandem Affinity Purification (Gavin 2002) and Synthetic Lethal (Tong 2001) data sets (from the Lee data superset) were compared. Of the 1391 genes measured in the Tandem Affinity Purification Gavin data set and 195 genes measured in the Synthetic Lethal Tong 2001 data set, there are only 64 genes common to both1. Figure 4-6 shows a superposition of subsets of the full adjacency matrices including only the 64 common genes. The red pixels represent interactions between genes in the Tong data set (i.e. where the Tong adjacency matrix = 1), and the yellow pixels represent interaction between genes in the Gavin data set (i.e. where the Gavin adjacency matrix = 1). White pixels represent interactions present in both the Gavin and Tong data.

Figure 4-6: Superimposed adjacency matrices of Gavin 2002 and Tong 2001 data contain a common sub-set of genes (nodes). Interactions from the Tong data are coloured red, Gavin interactions are yellow, and common interaction are white.

Figure 4-6 shows only 2 white pixels, meaning that there is only one interaction in common between the Gavin and Tong 2001 data sub-sets (since all links are assumed to be undirected, the adjacency matrices are symmetrical about the diagonal, so 2 pixels = 1 interaction). The common interaction is between YCR088W and YJL095W, and they are both involved with establishment of cell polarity (GO:0030468). There are 17 interactions in the Gavin data sub-set that are not in the Tong 2001 data sub-set, and 48 interactions in the Tong data sub-set that are not in the Gavin data sub-set. Hence out of a total of 17 + 48 + 1 = 66 interactions, only 1 interaction (1.5%) is shared by both sets. Figure 4-7 shows the percentage of shared interactions over common nodes between all pairs of data sets.

1 The sub-set of common genes between two networks is found by the intersection of each network’s list of genes. To guarantee that the gene was measured in the dataset, only those genes that have at least one interaction are included.

S. J. Lycett 24

Figure 4-7: Percentage of shared interactions between genes in common in pairs of experimental interaction data sets. The horizontal axis is the same as the vertical axis, hence the diagonal represents the 100% similarity between an experimental data set and itself.

In Figure 4-7 it is noticeable that most pairs of networks share less than 10% of the interactions between common nodes. The pairs that share less than 2% of interactions between common nodes are listed in Table 4-7.

Table 4-7: Dissimilarity of interactions between common nodes of experimental data sub-sets.

Network 1 Dis-similiarity (< 2 % shared interactions)Co-expression Ho

Ito Tong 2002Uetz

(TAP)(Yeast Two Hybrid)(Yeast Two Hybrid)(Yeast Two Hybrid)

Gavin (Tandem Affinity Purification) Tong 2001 (Synthetic Lethal)Gene Fusion Ho

ItoTong 2001

(TAP)(Yeast Two Hybrid)(Synthetic Lethal)

DIP small scale experiments Tong 2001 (Synthetic Lethal)Tong 2001 (Synthetic Lethal) Gavin

Ho Uetz Tong 2002 Gene FusionDIP small scale

(TAP)(TAP)(Yeast Two Hybrid)(Yeast Two Hybrid)

From Table 4-7 it is clear that Yeast Two Hybrid experiments share less than 2% of the total interactions amongst genes in common with Co-expression, Gene Fusion or Synthetic Lethal experiments. Additionally, the Synthetic Lethal experiment was dissimilar to the Tandem Affinity Purification data sets, and dissimilar to the DIP small scale data set.

Co-expression

Co-citation

Gavin

Ho

Ito

Phylogenetic Profile

Gene Fusion

DIP small scale

Tong 2001

Tong 2002

Uetz

Co-expression

Co-citation

Gavin

Ho

Ito

Phylogenetic Profile

Gene Fusion

DIP small scale

Tong 2001

Tong 2002

Uetz

S. J. Lycett 25

4.2.1.2 Network Similarity Measure

In order to test whether interaction networks are statistically similar to or different from each other, a network similarity measure (score) was developed. In this sub-section, the likelihood ratio concept introduced in section 2.4 is used to develop a network similarity measure (score) largely based on ideas by Dr. Malcolm Farrow (Personal communication2). A likelihood is a measure of how well the data fit a hypothesis, and the ratio of likelihoods from two competing hypotheses can be used to decide between them.

Suppose that a network (G) of interactions between genes or proteins has already been measured (or compiled) and another network (D) has just been measured, and we wish to determine whether G and D describe the same set of interactions (Hypothesis 1) or not (Hypothesis 0). For example if network G, was a Gold Standard and network D was some experimental data, we might like to calculate the likelihood ratio of D representing some ‘true’ interactions or not.

As in section 4.2.1.1, representing each interaction network as an adjacency matrix, and considering only the nodes common to both networks, the number of interactions that are the same or different are:

Symbol Meaning Namen1,1 number of links in both D and G True Positives (TP)n1,0 number of links in D that are not in G False Positives (FP)n0,1 number of links in G that are not in D False Negatives (FN)n0,0 number of links not present in both D and G True Negatives (TN)N1 Total number of positives in G = n1,1 + n0,1 Number of positives (NP)N0 Total number of negatives in G = n0,0 + n1,0 Number of negatives (NN)Table 4-8: Symbols for number of links

Note that the calculation of the quantities in Table 4-8 were achieved efficiently in MATLAB, see the Appendix section 8.4.1 for details. Once the number of interactions has been counted, the probability that a link found in D is a ‘true positive’ link (i.e. also found in G) is estimated as:

‘true positive’ probability p1,1 = number of true positives / number of positivesp1,1 = n1,1 / N1

Similarly the other probabilities are:

‘false positive’ probability p1,0 = n1,0 / N0

‘false negative’ probability p0,1 = n0,1 / N1

‘true negative’ probability p0,0 = n0,0 / N0

Hence:The likelihood that data D measures ‘true’ links is (Hypothesis 1): L1 = p1,1 / p1,0

The likelihood that data D measures ‘false’ links is (Hypothesis 0): L0 = p1,0 / p0,0

2 “Notes from Dec 13th Meeting” Personnal communication, M. Farrow, Newcastle (2006)

S. J. Lycett 26

Consequently, the likelihood ratio for D measuring true links as opposed to false links, given that ‘truth’ is represented by G is :

Notice that it would not make any difference to the likelihood ratio if D was assumed to represent the truth, and G was the measurement, because the products n1,1n0,0 and n1,0n0,1 would remain the same. Consequently can also be used as a similarity measure between any two networks A and B, and can be thought of as measuring the likelihood that network A matches network B, as opposed the likelihood that A does not match B. Finally, to avoid excessively large or small numbers, the natural log (ln) of the likelihood ratio was used as the similarity measure between networks.

4.2.1.3 Similarity and differences between data from different types of experiment

To test whether other data sources also show that interaction networks resulting from different types of experiments are dissimilar, the MIPS experimental data and BioGRID data sets are used. As described in 3.1.1.2, the interactions in the both the MIPS experimental database and BioGRID data are classified by experimental type, hence networks per experimental type have been created (see Table 3-4 and Table 3-5). Note that each network could contain links from a variety of studies.

In the previous sub-section, a log-likelihood ratio score calculated over the common nodes was proposed as a similarity measure between pairs of networks. The log-likelihood ratio score () is a good metric to use because it is symmetric ( (A,B) = (B,A) ) and accounts for the number of interactions that are the same (n1,1), the number of interactions that are different (n1,0, n0,1), and the number of interactions absent in both networks (n0,0).

A log-likelihood ratio score of 0 means that the networks are neither similar nor dissimilar, a large positive score means that the networks are very similar and a large negative score means that the networks are very dissimilar. Figure 4-8 and Figure 4-9 show a heat map image of the log-likelihood ratio score between pairs of networks from the MIPS and BioGRID databases respectively. The log-likelihood ratio has only been calculated over the nodes (genes) common to both.

Additionally, dendrograms of experiment types grouping similar scoring networks together from MIPS or BioGRID were created by using the in-built phylogenetic tree MATLAB function from the Bioinformatics tool box : seqlinkage. To use seqlinkage the log-likelihood ratio score was converted into a simple ‘distance’ measure d:

The dendrograms corresponding to the log-likelihood ratio distance (d) are also displayed on the left hand side of each log-likelihood heat map in Figure 4-8 and Figure 4-9. To try to discount the effect of small networks that may have a very small

S. J. Lycett 27

overlap with other data sets, only those BioGRID networks with more than 1000 links per experiment type have been included in this analysis. However, since most of the MIPS networks have less than 1000 links per experiment type, all of the MIPS data was used.

Figure 4-8: Groupings of MIPS experimental data networks according to the log-likelihood ratio score over sub-sets of common genes.

Figure 4-9: Groupings of BioGRID data networks according to the log-likelihood ratio score over sub-sets of common genes.

In the results from both the MIPS and BioGRID data, it can be seen that the networks of physical interactions (highlighted in blue) form a separate group from those containing genetic interactions (highlighted in green). Since Figure 4-8 and Figure 4-9 show that similar types of experiment give rise to similar interaction networks, it is inferred that different types of experiment tend to have less similar interaction networks.

Affinity ChromatographyRetardationCross LinkingPhysical interaction (general)In vitro reconstitutionGenetic interaction (general)Co-immunoprecipitationYeast two hybridIndividual experimentExperiment typeCentrifugationHigh throughput experimentGenetic experiment (other)Genetic suppressionSynthetic phenotypeOverlay (Far western Blot)

Affinity ChromatographyRetardationCross LinkingPhysical interaction (general)In vitro reconstitutionGenetic interaction (general)Co-immunoprecipitationYeast two hybridIndividual experimentExperiment typeCentrifugationHigh throughput experimentGenetic experiment (other)Genetic suppressionSynthetic phenotypeOverlay (Far western Blot)

Synthetic Growth Defect

Phenotypic Enhancement

Synthetic Lethality

Phenotypic Suppression

Biochemical Activity

Synthetic Rescue

Dosage Rescue

Affinity Capture-ms

Co-purification

Reconstituted Complex

Affinity Capture-western

Yeast Two Hybrid

Synthetic Growth Defect

Phenotypic Enhancement

Synthetic Lethality

Phenotypic Suppression

Biochemical Activity

Synthetic Rescue

Dosage Rescue

Affinity Capture-ms

Co-purification

Reconstituted Complex

Affinity Capture-western

Yeast Two Hybrid

S. J. Lycett 28

4.2.2 Types of functional interactions present in data from different experiment types

In the previous sub-section, the similarities and differences between the number of common interactions in various types of experimental data networks were explored. In this sub-section the type of functional interactions present in the individual networks used by Lee et al are investigated.

To gain a visual appreciation of the interactions between genes, ‘compressed’ adjacency matrices were used instead of the usual ball-and-spoke network images. Figure 4-10 shows the Gavin 2002 network in the form of a compressed image of the adjacency matrix with the genes ordered by biological process (see 3.2.2). The compressed image was created by condensing the 6674 x 6674 adjacency matrix into a 256 x 256 image array. Each pixel of the compressed image is the sum of a 27 x 27 square of the adjacency matrix divided by the number of elements (27 x 27 = 729). If the 27 x 27 genes formed a full connected cluster (clique) then the value of the equivalent pixel in compressed image would be 1. The colour scale used in the compressed image is displayed on the bar to the right hand side, with blue representing the minimum value in the compressed matrix (usually 0) and red representing the maximum value in the compressed matrix (0.21 in the case of Figure4-10)

Figure 4-10: Compressed adjacency matrix of Gavin 2002 data, genes ordered by Biological Process

Since the genes in Figure 4-10 have been ordered according to biological function via their deepest GO terms (see 3.2.2), interacting genes with similar biological function will form clusters along the diagonal. Consequently, interactions along the diagonal can be thought of as having a ‘type’ corresponding to the shared GO term of their interacting genes.

S. J. Lycett 29

To provide a visual indication of the different types of interactions present in experimental data networks according to the biological process, cellular component or molecular function GO terms, the values of the pixels along the diagonals of the compressed image adjacency matrices were extracted from all the individual data sets used by Lee and co-workers. (Lee 2004, see 3.1.1.1). These values indicate the number of interacting genes that share related deepest GO terms. The values are normalised for display purposes, so that the maximum value per profile is 1 and are plotted in Figure 4-11 a-c.

Uetz (Y2H)Tong 2002 (Y2H)Tong 2001 (SL)DIP small scaleGene fusionPhylogenetic profileIto (Y2H)Ho (TAP)Gavin (TAP)Co-citationCo-expression

GO:0006414translational elongation

GO:0006511ubiquitin-dependent

protein catabolic process

GO:0000398nuclear mRNA splicing,

via spliceosome

GO:0045041protein import into

mitochondrial intermembrane

space

GO:0045143HomologousChromosome

segregation

GO:0000282cellular bud

site selection

GO:0000723Telomere maintenance

GO:0000910cytokinesis

GO:0030011maintenance ofcell polarity

GO:0042254ribosome biogenesis and assembly


GO:0006414translational elongation

GO:0006511ubiquitin-dependent

protein catabolic process

GO:0000398nuclear mRNA splicing,

via spliceosome

GO:0045041protein import into

mitochondrial intermembrane

space

GO:0045143HomologousChromosome

segregation

GO:0000282cellular bud

site selection

GO:0000723Telomere maintenance

GO:0000910cytokinesis

GO:0030011maintenance ofcell polarity

GO:0042254ribosome biogenesis and assembly

Figure 4-11 a: Compressed Profiles of 11 Data sets, genes ordered by Biological Process

Figure 4-11a (compressed profiles with genes ordered by biological process) shows that the Co-expression data and the Phylogenetic profile data share a large cluster of interacting genes relating to nuclear mRNA splicing, but no such cluster can be seen in the Yeast Two Hybrid data sets (Ito, Uetz, Tong 2002). When the data sets are viewed ordered by cellular component, most data sets have interactions between genes associated with ribosomes (left hand side of Figure 4-11b), and Co-expression also has a large cluster of interactions between genes associated with the nucleolus. In terms of molecular function, the Tandem Affinity Purification data sets (Gavin and Ho), the Yeast Two Hybrid data sets (Ito and Uetz), and DIP small scale data sets have a medium sized peak indicating interactions between genes with an RNA binding function. The Co-expression and Tandem Affinity Purification data sets (Gavin and Ho) show many interactions between genes annotated with an mRNA binding function.

S. J. Lycett 30

Figure 4-6 b: Compressed Profiles genes ordered by Cellular Component

Figure 4-6 c: Compressed Profiles, genes ordered by Molecular Function


GO:0005737cytoplasm

GO:0005886Plasmamembrane

GO:0005811lipid particle

GO:0005759mitochondrial

matrix

GO:0005730nucleolus

GO:0005840ribosome

GO:0005762mitochondrial large

ribosomal subunit

GO:0005842cytosolic largeribosomal subunit


GO:0005737cytoplasm

GO:0005886Plasmamembrane

GO:0005811lipid particle

GO:0005759mitochondrial

matrix

GO:0005730nucleolus

GO:0005840ribosome

GO:0005762mitochondrial large

ribosomal subunit

GO:0005842cytosolic largeribosomal subunit


GO:0003674Molecular function

GO:0004842ubiquitin-protein ligase

activity

GO:0004596peptide alpha-N-acetyltransferase activity

GO:0004674protein serine/threonine kinase activityGO:0003977UDP-N-acetylglucosamine diphosphorylase activity

GO:0005198structural molecule activity

GO:0003723 RNA binding

GO:0003729 mRNA binding GO:0000182 rDNA binding

GO:0030533triplet codon-aminoacid adaptor activity




activity









activity









activity







S. J. Lycett 31

The compressed profiles of Figure 4-11a-c provide an indication of clusters of interacting genes that share a similar biological process, cellular component or molecular function. To provide a more detailed examination of the differences between experiments, the number of interactions that share a GO term was considered. For each experiment a ‘GO profile’ is calculated. Here a ‘GO profile’ consists of the number of interactions between genes that share ‘deepest’ GO terms according to one of the three GO term orders. To elucidate the differences between the experimental data sets, the top 10 most populated GO terms in the GO profiles were compared. Table 4-9 shows which biological process GO terms were present in the top 10 GO terms from each experimental data set.

Table 4-9: Biological Process GO terms from shared interactions in 11 data sets. Y means term is present in top 10 terms, n means term is not present in top 10 terms.

SL TAP Y2H Y2H Y2H Y2HGO ID GO Description Co-

exp.Phylo.

Co-cit.

Gene Fus.

Tong ‘01

DIP Gav. Ho Ito Uetz Tong ‘02

GO:0006414 translational elongation Y Y Y Y n Y Y Y Y Y n

GO:0006511Ubiquitin-dependent protein catabolic process

n n Y Y n Y Y n Y Y n

GO:0000398nuclear mRNA splicing, via spliceosome

Y Y n n n n Y n N n n

GO:000636535S primary transcript processing

Y Y n n n n n n N n n

GO:0000282 cellular bud site selection n n n Y n n n n N n n

GO:0042254Ribosome biogenesis and assembly

Y Y Y n n Y Y Y Y Y n

GO:0000723 Telomere maintenance n Y Y Y n Y n n N n Y

GO:0007047cell wall organization and biogenesis

n n Y n Y Y n Y Y Y n

GO:0000910 Cytokinesis n Y n n n n Y Y Y n n

Reassuringly, most of the experimental data sets seem to contain many interactions between genes pertaining to translational elongation or to ribosome biogenesis and assembly (pink boxes). Interestingly, none of the other GO terms that appear in the top 10 GO terms for at least two of Ho, Ito and Uetz protein-protein interaction experiments (blue boxes) appear amongst the top 10 terms in Co-expression. Similarly Co-expression and Phylogenetic profile both have mRNA splicing and 35S primary transcript processing GO terms in their top 10 (green boxes), whereas none of the Yeast Two Hybrid experiments do.

S. J. Lycett 32

4.2.3 Summary

In this section, the similarities and differences between interactions in the data sets used by Lee et al. and composite datasets from MIPS and BioGRID databases have been investigated.

Firstly, it was found that some pairs of Lee data sets shared less than 2% of the interactions between genes that were common to both sets. In particular from the Lee data it was found that the Yeast Two Hybrid sets share very few interactions with the Co-expression, Gene Fusion or Synthetic Lethal experiments; and the Synthetic Lethal experiment shared very few interactions with the Tandem Affinity Purification and DIP small scale data sets. Co-expression, Gene Fusion and Synthetic Lethal experiments measure various types of genetic interaction, while Yeast Two Hybrid and Tandem Affinity Purification measure types of physical protein-protein interaction. These results imply that genetic and physical interaction experiments measure different interactions between the same set of genes or their products.

Secondly, to verify the initial results from the Lee data set, networks of interactions from the same type of experiment from both MIPS and BioGRID databases were examined. A log-likelihood ratio method to measure the similarity between two networks was developed, which has the useful property of being symmetric (the log likelihood score of network A vs B is the same as B vs A) and accounts for the number of interactions that are the same and different. Each type of MIPS experimental data was scored against all the other types of MIPS experiment data using the log-likelihood ratio method. Also data from BioGRID was compiled into networks by experimental type, and those networks with more than 1000 links were scored against each other. The matrix of log-likelihood ratio scores from each database was converted into a distance measure. The distance measure was used to create dendrograms and thus provide a means to cluster similar scoring networks. These dendrograms confirmed that networks from physical interaction experiments were similar to each other, and that networks from genetic interaction experiments were similar to each other. The results also implied that interaction networks from different experimental techniques are dissimilar.

Thirdly, the type of functional interactions present in the Lee data sets were considered by finding interacting genes with shared GO terms. The results showed that the type of interactions exhibited were similar between the Yeast Two Hybrid and Tandem Affinity Purification data sets. It was also found that the Yeast Two Hybrid experiments showed interactions between genes with markedly different GO term annotations than those from the Co-expression data set. These results again confirmed that physical and genetic interaction experiments measure different types of functional interaction.

Taking all of the results in this section together, it is concluded that different experimental techniques do indeed highlight qualitatively different types of interaction.

S. J. Lycett 33

4.3 Network Integration

In the previous section it was shown that different types of experiment measure different types of interaction amongst the same sub-set of genes. Consequently, combining interaction networks from diverse experimental types will enable a more comprehensive integrated network of functional interactions to be formed than can be provided by any individual experiment type. The purpose of the work reported and discussed in this section is to investigate the question:

How can diverse data sets be integrated ?

The main problem with integrating diverse data sets is how to judge their relative importance and quality. In section 4.2.1, a log-likelihood ratio was used as a means to measure the similarity between two networks, while in previous work by several groups (Deng 2003, Jensen 2003, Lee 2004, Yamanishi 2004, Myers 2006, Keimer 2007), data sets have been given a quality score by comparing the measured interactions with known interactions. Typically, either a single Gold Standard (GS) set of interactions is used; or alternatively two Gold Standards are used - one containing links between genes that are known to interact (used to measure positive links), and one containing links between genes that are assumed not to interact (negative links). Gold Standard interaction sets are usually obtained from highly curated databases such as KEGG, MIPS and GO. Notably, Myers et al. (Myers 2006) compiled their own Gold Standard based on GO because they found that scoring experimental data sets against KEGG PATHWAYS and a Biological Process GO Network gave quite different results.

Since networks from manually curated databases are intended to represent particular types of interaction (e.g. KEGG PATHWAYS captures interactions between genes on the same pathway, and Biological Process GO captures genes annotated with the same biological process etc), it is hypothesised that several of these ‘Gold Standard’ networks are needed to score different aspects of the interaction data. For example, it might be that one data set scores highly against one particular ‘Gold Standard’ (GS) because the experimental technique used to gather the data strongly reflects the particular type of interactions described by the GS. This same data set could score weakly against another GS, because the second GS represents a different type of interaction. However, both GS could be equally valid reflections of reality.

In this section, the choice of reference networks is investigated and a method for Bayesian data fusion is developed.

S. J. Lycett 34

4.3.1 Choice of Reference Networks

4.3.1.1 Candidate Reference Networks

The purpose of a reference network is to provide some information on currently known interactions, which can be used to validate an experimental data set. In this part of the project, a set of reference networks are sought, each characterising a different type of functional interaction. Therefore the reference networks should be as different from each other as possible, although they do not necessarily have to be non-overlapping or orthogonal. Importantly, the set of reference networks should represent a set of biologically meaningful interactions, for example links between genes / proteins that are:

In the same pathway In the same cellular compartment Involved in the same biological process Have the same molecular function Are part of the same complex

Consequently, the following reference data sets were considered as candidate reference networks (see 3.1.2 for a full description):

Reference Data

Network Link Types

KEGG KEGG (Pathways) Genes in the same pathway are linkedKEGG 2 (Direct) Gene are linked if,

Genes have same Enzyme classification ANDEnzymes associated with same Reaction ANDReaction on same Pathway

GO BP GO X Biological Process, X = Levels 5 – 9CC GO X Cellular Component, X = Levels 5 – 7MF GO X Molecular Function, X = Levels 5 – 9

MIPS MIPS Complexes Genes linked if part of same complexMIPS Enzymes Genes linked if have same enzyme classification

Table 4-10: Candidate Reference Networks

To select a suitable set of reference networks from the candidate list in Table 4-10, two criteria were considered:

1. Choose reference networks that are dissimilar to each other (have low similarity scores against each other);

2. Choose reference networks that give a range of similarity scores over the experimental data sets

The underlying assumptions behind these choices are that:

Different experimental techniques measure different types of links Several data sets spanning a range of experimental techniques should be used Similarity scores between networks are calculated using Log Likelihood Ratio

over nodes common to both networks

S. J. Lycett 35

4.3.1.2 Difference between reference networks

Similarity between pairs of reference networks has previously been measured using log-likelihood ratios over the genes they have in common. The resulting matrix of log-likelihood ratios is displayed in the form of a heat map in Figure 4-12. Dark colours represent dissimilar networks, and white represents very similar networks. The log-likelihood data has been thresholded for display purposes so that log-likelihoods greater than 16 appear white (the self against self values on the diagonal are infinite), and those less than –3 appear black. The order of the networks is the same horizontally as vertically, but for clarity only some networks are labelled on the horizontal axis.

Figure 4-12: Log-likelihood score heat map for reference networks

The most obvious feature of Figure 4-12 is the presence of bright blocks down the diagonal, which show that networks derived from the same reference source score highly against each other. This observation is entirely unsurprising because the within a block, the later listed networks are sub-sets of their predecessors e.g. BP GO 9 is a subset of BP GO 5 etc. However, more interesting relationships are found between networks from different reference sources. For example, MIPS Enzymes scores quite highly against MF GO 6 but has a very low score against MF GO 7. The very low score of MIPS Enzymes against MF GO 7 occurs because there is only one common node. Note that to avoid infinities, a score of 0 is given to pairs of networks that do not have any nodes in common. A summary of the findings for each type of network is displayed in Table 4-11.

Kegg BP GO 5 CC GO 5 MF GO 5 MIPS Complexes

KeggKegg2BP GO 5BP GO 6BP GO 7BP GO 8BP GO 9CC GO 5CC GO 6CC GO 7MF GO 5MF GO 6MF GO 7MF GO 8MF GO 9MIPS ComplexesMIPS Enzymes

Very similar

Dis-similar

Kegg BP GO 5 CC GO 5 MF GO 5 MIPS ComplexesKegg BP GO 5 CC GO 5 MF GO 5 MIPS Complexes

KeggKegg2BP GO 5BP GO 6BP GO 7BP GO 8BP GO 9CC GO 5CC GO 6CC GO 7MF GO 5MF GO 6MF GO 7MF GO 8MF GO 9MIPS ComplexesMIPS Enzymes

Very similar

Dis-similar

S. J. Lycett 36

Reference Network Similar Not very similar No nodes in common

KEGG KEGG 2BP GO 9MIPS Complexes

- CC GO 7

KEGG 2 KEGGMF GO 5 - 7

CC GO 6 - 7MF GO 9

-

BP GO 5 BP GO 6 - 9 CC GO 5MF GO 8MIPS Enzymes

-

CC GO 6 - 7 CC GO 5MF GO 9

KEGG 2MIPS Enzymes

KEGG (with 7)

MF GO 8 MF GO 5 - 7 BP GO 5CC GO 5

MIPS Enzymes

MIPS Complexes KEGG MIPS Enzymes -MIPS Enzymes MF GO 6 BP GO 5

CC GO 5 – 6MF GO 7

CC GO 7MF GO 8 – 9

Table 4-11: Similar and dissimilar reference networks

Figure 4-13: Dendrogram showing relationship between reference networks

MIPS EnzymesCC GO 7CC GO 6CC GO 5MF GO 9MF GO 8MF GO 7MF GO 6MF GO 5BP GO 9BP GO 8BP GO 7BP GO 6BP GO 5MIPS ComplexesKEGG 2KEGG

MIPS EnzymesCC GO 7CC GO 6CC GO 5MF GO 9MF GO 8MF GO 7MF GO 6MF GO 5BP GO 9BP GO 8BP GO 7BP GO 6BP GO 5MIPS ComplexesKEGG 2KEGG

S. J. Lycett 37

The relationship between the reference networks is also shown in a dendrogram in Figure 4-13. The dendrogram was created using the method described in section 4.2.1 by converting the log-likelihood ratio similarity score into a distance measure. The dendrogram shows that there are 5 main groups of reference network : KEGG & Complexes; Biological Process GO; Molecular Function GO; Cellular Component GO; and MIPS Enzymes.

Since the networks within each group are similar, one network from each group was selected for further analysis. The number of interacting genes in each network was a factor in the selection – a network with too few interacting genes may not have sufficient overlap with an experimental data set to be of any use in validating it. However, a network containing large fully connected clusters of interacting genes may not be specific enough to provide a good measure of experimental false positives. Considering the dissimilar networks shown in Table 4-11, and the sizes of the networks in Figure 3-4 the short list of candidate reference networks are:

KEGG (Pathways) Biological Process GO 6 Cellular Component GO 5 Molecular Function GO 5 MIPS Enzymes

KEGG (Pathways) was chosen above KEGG 2 (Direct) because KEGG is closer in size to the other networks. KEGG is preferred over MIPS Complexes because KEGG represents links between proteins on the same pathway, but pathway information is not well represented by the other groups of networks. Biological Process GO 6, Cellular Component GO 5 and Molecular Function GO 5 where chosen from the GO networks because their size provides a good compromise between potential overlap with experimental data and describing specific interactions.

4.3.2 Data scores against Reference Networks

In the previous sub-section, five candidate reference networks were shortlisted for potential inclusion in the set of reference networks for the data integration. In this sub-section, the ability of the 5 candidate reference networks to distinguish between different experimental data sets will be investigated, and the final selection of reference networks will be made.

4.3.2.1 Data score diversity using Reference Networks

To gain an understanding of the diversity of log-likelihood scores on experimental data by the reference networks, the scores of all of the individual data networks in the Lee, MIPS and BioGRID data sets were calculated. The BioGRID data was split into networks by study, but smaller studies were combined into data sets by experimental type (see section 3.1.1.3 and the Appendix section 8.1). Figure 4-14 shows the log-likelihood ratios for those experimental data sets with absolute values of the log-likelihood ratio less than 7.0.

S. J. Lycett 38

Log-Likelihood Ratios for Selected Data Networks

-2

-1

0

1

2

3

4

5

6

7

Experiment

Log-

Like

lihoo

d R

atio

KEGGBP GO 6CC GO 5MF GO 5Enzymes

Lee Data

BioGRID by Study BioGRID by Small Study MIPSExperiment TypeCollins

(Affinity Capture)

Pan(Synthetic Lethality)

Other(Co-crystal structure) Co-immunoprecipation

Figure 4-14: Log-likelihood ratios of experimental data sets against the candidate reference networks

Figure 4-14 indicates that there may be some correlation between the scores from the experimental data sets against different reference networks. To try to indentify a non-redundant set of reference networks, Principal Component Analysis (PCA) was used. PCA is a standard mathematical technique used to find the most important orthogonal directions in a multidimensional data set (e.g. see Shlens 2005). here, PCA was performed here using the MATLAB singular value decomposition function svd [MATLAB svd]. The input to svd is a matrix, each column of which represents one vector of log-likelihood ratio scores per reference network, while each element of the vector corresponds to an experimental data set. The output of svd is a column matrix of eigen vectors and a set of weights indicating the importance of each eigen vector. The elements of each column correspond to the weight of the reference networks contributing to that eigen vector. The columns output from svd are arranged in order of importance, the first column corresponds to the eigen vector with the highest (absolute) eigen value, and represents the direction of maximum variation in the data.

Figure 4-15 shows a heat map of the output matrix of eigen vectors of the log-likelihood ratio scores of the experimental networks depicted in Figure 4-14. The first three eigen vectors are highlighted in green. If each reference network represented a unique direction in the log-likelihood ratio score data, then each eigen vector would represent only one of the reference networks, hence each column in the heat map would have one high value element (white) and all the other elements would have a low value (black). Since the reference networks do not represent completely different aspects of the data, the columns contain more than one high element. Nevertheless, it can be seen that the KEGG reference network is the most important in Eigen Vector 1

http://www.mathworks.com/access/helpdesk/help/techdoc/ref/svd.html

S. J. Lycett 39

(EV 1), MIPS Enzymes is the most important in the second eigen vector (EV 2) and Molecular Function GO 5 is the most important in the third eigen vector (EV 3).Figure 4-15: PCA results from log-likelihood ratio scores of selected experimental data

To check the robustness of the PCA results, the analysis was repeated using the log-likelihood ratio scores of all 95 experimental data sets from Lee, MIPS and BioGRID data sets. Additionally, PCA analysis was performed excluding each of the reference networks in turn. Table 4-12 and Table 4-13 summarise which reference network contributed most to each eigen vector, while the number of times each reference network contributes the most to each eigen vector is displayed in Table 4-14.

Reference Networks EV 1 EV 2 EV 3 EV 4 EV 5Missing KEGG MF E BP CCMissing BP GO 6 K E MF CCMissing CC GO 5 K E MF KMissing MF GO 5 K E BP CCMissing Enzymes K MF BP CCAll Present K E MF BP CCTable 4-12: Highest contributions to eigen vectors using selected experimental data (51 networks)

Reference Networks EV 1 EV 2 EV 3 EV 4 EV 5Missing KEGG E MF MF CCMissing BP GO 6 E E MF CCMissing CC GO 5 E E K BPMissing MF GO 5 E K BP CCMissing Enzymes K BP MF CCAll Present E E BP MF CCTable 4-13: Highest contributions to eigen vectors using all experimental data (95 networks)

Occurrences EV 1 EV 2 EV 3 EV 4

KEGG

BP GO 6

CC GO 5

MF GO 5

Enzymes

EV 1 EV 2 EV 3 EV 4 EV 5

KEGG

BP GO 6

CC GO 5

MF GO 5

Enzymes

KEGG

BP GO 6

CC GO 5

MF GO 5

Enzymes

EV 1 EV 2 EV 3 EV 4 EV 5

S. J. Lycett 40

KEGG 6 1 1 1BP GO 6 0 1 5 2CC GO 5 0 0 0 8MF GO 5 1 2 6 1MIPS Enzymes 5 8 0 0Table 4-14: Number of times reference network contributes the most to each eigen vector

The consensus results in Table 4-14 show that KEGG, MIPS Enzymes, Molecular Function GO 5 and Cellular Component GO 5 make a good set of reference networks. Hence these networks were chosen as the final reference set.

4.3.3 Bayesian Data Fusion

As previously discussed, it is hypothesised that several reference networks are needed to score interaction data sets because each reference network represents a different type of the currently known functional interactions. In order to combine all the data sets into an integrated functional network, a three stage process was used.

Firstly, the data sets are scored against each reference network. Secondly, the scores from the individual data sets against each reference network are combined in turn, so that each link has a vector of four probabilities associated with it, each element of the vector corresponding to a different type of functional interaction, as measured by each of the reference networks. Finally, a single probability representing the strength of functional interaction between pairs of genes or proteins is calculated from the four aspect probabilities.

4.3.3.1 Probability Networks

A probabilistic network containing weighted links representing the probability of a ‘true’ interaction, can be derived from one or more data sets using Bayesian inference (e.g. see Lu 2005, Rhodes 2005, Kiemer 2007). In this sub-section a brief introduction to Bayesian inference is given and its use in network integration is described. The probability of a set of links (L) being present, given that they are present in the measured data (D) can be calculated using Bayes theorem:

Where:P(L|D) = the posterior probability of the links being present (after observing the data)P(D|L) = the likelihood of getting the data given that the links are presentP(L) = the prior probability of the links being present before any data is observedP(D) = a normalising constant However, it is often helpful to consider Bayes theorem in odds ratio form (because then the normalising constants are not required)

S. J. Lycett 41

Where:O(L|D) = posterior odds of links being present given that data D was

measured vs links not present (Lc) given that data D was measuredL1 = Likelihood of links being presentL0 = Likelihood of links not being presentO(L) = prior odds of links being present vs links not present

In previous sections the likelihood ratio has been used as a network similarity score. Here it is seen that the likelihood ratio = L1 / L0 is used to calculate the posterior odds. The posterior probability of the links being present, P(L|D) can recovered from the posterior odds via:

Now, suppose a link between nodes i and j is measured in many conditionally independent data sets. Let dij

k represent the value of a link between nodes i & j in data set k Let aij represent the value of a link between nodes i & j in the integrated network Let Dk represent a collection of k data sets

Further suppose a value for aij has already been established from k-1 data sets (Dk-1), but now a new data set k provides more evidence. Combining the new evidence with the existing evidence means that the posterior odds for a link between node i and node j given the k data sets is given by:

In other words, the posterior odds for the link given the new evidence and all of the existing evidence is equal to the likelihood ratio score of the new evidence (kth data set) multiplied by the odds for the link considering the existing evidence (k-1 th … 1st

data sets). Decomposing the odds for the k-1th to 1st data sets into the component likelihood ratios, and taking natural logs of both sides results in:

This equation reads that the log of the posterior odds of the link between node i and j is equal to the sum of the individual log likelihood ratio scores for the link plus the log of the prior odds of there being a link between node i and j ( Oinit(aij) ). Finally, the actual posterior probability (pij) for the link between node i and j being present given all the data and prior assumptions, is :

4.3.3.2 Use of Priors

In the above equations, the prior odds for the link between node i and j is important, but what should this quantity be? The prior odds is the ratio of the prior probability that a link between nodes i and j exist divided the prior probability that no link exists.

S. J. Lycett 42

In Bayesian statistics, prior probability means the subjective expected probability of an event before any data is observed (see for example [Prior Def] ).

Consider scoring and integrating experimental data with respect to the KEGG Pathways network. Before any experimental data is integrated, it is expected that any links present in the KEGG Pathways network will have a high probability of actually being present in nature, because KEGG is a well established and manually curated source of reference data for the biological community. Therefore, it is reasonable that high log prior odds are used for those links present in the KEGG pathways network. Considering the range of log-likelihood ratios measured (±7), a value of 20 seems appropriate. However, just because a link is not present in the KEGG pathways network does not mean that the link is unlikely to exist – it may just mean that the particular link in question has not yet been identified. As there is no definite information about non-present links in KEGG, the probability of the non-present link actually existing must be equal to the probability of it not existing, hence the prior odds for non-present links are 1 so the log-odds are 0.

Next, consider the other reference networks Molecular Function GO and MIPS Enzymes. In both cases, these networks are assumed to contain high probability links, so by applying the same argument as for the KEGG reference network, it is concluded that the links in the MF GO and MIPS Enzymes should be included as priors with high log odds (20) in their respective integrations also. However, in the case of the Cellular Component GO reference network, it is not true that genes or proteins annotated as being in the same cellular compartment are expected to have a functional interaction - perhaps they are more likely to, but it is difficult to determine on a link by link basis a priori. For this reason, the log prior odds for the integration with respect to Cellular Component GO is set to 0 for all the links.

4.3.3.3 Network Integration with Priors

The experimental data sets from BioGRID consisting of 47 individual studies and 22 data sets of small scale studies combined by experiment type were scored and integrated with respect to the KEGG, Cellular Component GO 5, Molecular Function GO 5 and MIPS Enzymes reference networks. As discussed above, the KEGG, Molecular Function and MIPS Enzymes networks were also included as priors in the respective integrations. Figure 4-16 show the compressed adjacency matrices of the resulting four combined networks, genes ordered by their deepest GO terms from molecular function.

http://cancerweb.ncl.ac.uk/cgi-bin/omd?prior+probability

S. J. Lycett 43

KEGG Combined Network CC GO 5 Combined Network

MF GO 5 Combined Network MIPS Enzymes Combined NetworkFigure 4-16: Compressed Adjacency matrix images of KEGG, CC GO 5, MF GO 5 & MIPS Enzymes Combined Networks. The KEGG, MF GO 5 & MIPS Enzymes reference networks were included as priors in the respective combined networks. Note that the maximum of the CC GO 5 is 0.5 rather than 1.

4.3.3.4 Integration of the Combined Networks

Sections 4.3.3.1 and 4.3.3.2 describe how experimental data networks were combined with the prior network with respect to each reference network. In this section, the approach to the final integration of the combined networks is considered.

Previously, four probabilities were calculated, each one with respect to a reference network, for each possible link between nodes i and j. Since the same data was used for each probability (apart from the prior networks), it might be anticipated that the probabilities are not independent of each other. However, a scatter plot of all the link probabilities with respect to the KEGG reference network versus those from the Cellular Component GO reference network, or Molecular Function GO reference network or MIPS Enzymes reference network did not seem to show a particularly strong correlation (Figure 4-17). If the link probabilities were strongly correlated, the scatter plot would be concentrated about the diagonal, but instead there were clusters of points about the edges. For example, the scatter plot indicates that there were many links with a high probability with respect to KEGG, but no particular probability with respect to another network (points on the right hand side of the figure). Additionally, repeating the scatter plot using the combination of only the data sets to calculate the four types of link probability and not including the priors yields the same results

S. J. Lycett 44

(figure not shown). The lack of a strong correlation between the four link probabilities is reassuring because the reference networks were specifically chosen to represent different aspects of functional interactions and so produced diverse data scores.

Figure 4-17: Scatter plot of link probabilities, KEGG vs CC (blue), KEGG vs MF (red), KEGG vs Enzymes (green)

Although there may well be some residual dependence between the four combined probabilistic networks, it was assumed that the combined probabilistic networks are conditionally independent for the purposes of the final integration.

If the networks are conditionally independent, then the final probability for a functional integration between node i and j is the probability of that link under the KEGG combined network OR the Cellular Component combined network, OR the Molecular Function combined network OR the Enzymes combined network3. The OR operation is appropriate because in section 4.3.1 an inclusive definition of functional interaction was considered. The final link probability under an OR operation is calculated as one minus the probability of the link not existing in any of the networks:

(Eqn 6)

As a worked example of how the combined network probabilities are integrated into a final probability consider the case where for a particular link pKEGG = 0.9, pCC = 0.5, pMF = 0.8, pEnzymes = 0.2. The final link probability (pFinal) is now greater than the maximum of the individual probabilities because:

Even if the individual probabilities are very weak, e.g. 0.01 then integrating the weak lines of evidence still produces a higher final link probability (0.039).

3 If the networks were not independent then equation 6 would have extra terms describing the correlation between the networks. The extra terms would reduce the final link probability.

S. J. Lycett 45

4.3.3.5 Final IntegrationThe final integration of the KEGG, CC GO 5, MF GO 5 and MIPS Enzymes combined networks of Figure 4-16 was performed using equation 6 to give an integrated functional network. The compressed adjacency matrix images with genes ordered by biological process, cellular component and molecular function respectively are displayed in the Figure 4-18. The properties of this integrated functional network will be examined in section 4.4.

Figure 4-18: Compressed adjacency matrix image of Integrated Functional Network

Genes ordered by: biological process

Genes ordered by: cellular component

Genes ordered by: molecular function

S. J. Lycett 46

4.3.4 Discussion

The methods developed in this subsection depart in several respects from those previously reported in the literature. Firstly, although log-likelihood scoring as part of data integration has been used by several groups, the log-likelihood ratio score presented here is somewhat different to that used by Lee et al. (Lee 2004). Secondly, the data is scored against four reference networks, each reference network representing a different type of functional interaction; and an explicit computational method was developed to determine which reference networks to use. Thirdly, the possibility of including the reference networks as priors when forming an integrated network has not been previously addressed. Each of these three points is described in more detail below.

4.3.4.1 Log-Likelihood Ratio Score

In section 4.2.1.2 a log-likelihood ratio score was introduced as a statistical means of measuring the similarity between two networks. This log-likelihood ratio is used in section 4.3.3.1 to compute a set of posterior probabilities for each interaction by comparing the experimental data to a set of reference networks, assuming that the reference networks represent some aspect of the ‘true’ functional associations. Because the log-likelihood ratio used here is comparing a data network to a reference network for the purposes of determining how similar the data network is to the reference network, it is fair to calculate the ‘true’ positives and ‘false’ positives etc. with respect to the links present or absent in reference network. Since the data or reference network may not contain interaction information about all the genes, the number of ‘true’ and ‘false’ positives is only calculated over genes common to both networks. Consequently a very small data set containing only one interaction (gene A – gene B) may nevertheless receive a very high log-likelihood ratio4 if that interaction is present in the reference data.

An alternative method for calculating the number of ‘true’ and ‘false’ positives etc. is to compare the data network to one reference network containing positive links and a different reference network containing negative links, where a negative link indicates the definite absence of an interaction. These positive and negative gold standards are often used to train machine learning algorithms and for performance evaluation using ROC curves. Additionally a positive and negative reference network can be used to calculate the numbers of true and false positives etc. for a log-likelihood calculation (Kiemer 2007).

Both Lee et al. and Kiemer et al. (Lee 2004, Kiemer 2007) use a slightly different log-likelihood ratio to the one used in this project. In this project the log likelihood ratio is the ratio of the likelihood that the data measures ‘true’ links over the likelihood that the data measures ‘false’ links, whereas Lee and co workers and Kiemer and co workers just use the likelihood that the data measures ‘true’ links only (see 8.3). The numerical difference between the likelihood ratio used here and that of Lee and co workers was quite small when considering the scores of the Lee data sets against each other. However unlike the Lee scores, the likelihood ratio used here has the useful properties of being symmetric (the likelihood ratio score of A against B is the same as B against A) and independent of gold standard size.

4 Note that in the MATLAB implementation of the log-likelihood ratio score, the value 20 is returned instead of infinity

S. J. Lycett 47

4.3.4.2 Choice of Reference Networks

The concept behind using multiple reference networks is to enable different types of experimental data set to be scored appropriately. In principle, a reference network representing the outcome of a perfect experiment could be used for all experiments of that type. However, no such references exist, and if they did then it would not be necessary to do the experimental data integration in the first place since all the interaction information would be known. Instead, there are a few manually curated reference sources containing known (or assumed) linkages of particular functional types. In this project, rather than manually choosing an appropriate set of reference networks that would adequately cover the range and type of interactions to be scored and have minimal overlap, it was decided to use a computational means in which minimal expert knowledge was required. Hence principal component analysis was used to try to discover a suitable set of reference networks. Ideally, the set of reference networks would be orthogonal to each other – i.e. they would contain totally different interactions even over the same sub-set of genes. Since the output of principal component analysis is an orthogonal set of basis functions it is possible to create orthogonal networks by taking linear combinations of the original reference networks (e.g. basis 1 = A x network 1 + B x network 2, basis 2 = C x network 1 – D x network 2 etc). However to keep the biological meaning of the reference networks it was decided not to use the constructed orthogonal set, and instead use the reference networks that best approximated the orthogonal set. Using the PCA procedure over all the experimental data, the KEGG Pathways, Cellular Component GO, Molecular Function GO and MIPS Enzymes networks were identified as a suitable reference set.

4.3.4.3 Prior Information

In the Bayesian data fusion method described in 4.3.3.1, the posterior odds of a link existing is calculated by updating its prior odds with the likelihood ratio of the observations. Since four different reference networks were identified, each describing a different view of the functional interactions, four likelihood ratios for each link are generated.

As the KEGG, Molecular Function and MIPS Enzymes reference networks represent currently known interactions, the links they contain are included in their respective integrations as prior information with a high probability. The inclusion of reference networks as priors in the probabilistic functional integrated networks has not been examined in depth before, since other groups have preferred to focus on integration of experimental evidence only. Including reference networks as priors means that the integrated networks contain validated experimental data and human curated functional interaction data.

S. J. Lycett 48

4.4 Evaluation of the Integrated Network

In order to assess the value of the multiple gold standard approach, an integrated functional network was created using the method developed in 4.3.3 from the BioGRID data. This network was composed of 47 individual studies and 22 composite data sets of smaller studies combined by experiment using KEGG, Cellular Component, Molecular Function and MIPS Enzymes as reference networks. Compressed images of the adjacency matrix are displayed in Figure 4-18.

In this section the properties of the integrated functional network are described and assessed. The discriminatory performance of the network was measured with ROC and Precision-Recall curves using the Myers positive and negative gold standard (Myers 2006). Additionally, the potential of the network for gene functional prediction was assessed by making annotation predictions for genes of unknown function.

4.4.1 Relative Performance Evaluation

To assess the quality of the integrated functional network, precision-recall and sensitivity-specificity curves were calculated against the Myers expert-curated positive and negative gold standards (Myers 2006).

Figure 4-19 shows the Precision-Recall curves for the combined networks using KEGG, CC GO 5, MF GO 5 or MIPS Enzymes as the reference network on the 69 BioGRID data sets together with the curve for the final integrated functional network (‘whitening integration’). The same results are displayed in Figure 4-20, but in this case networks were combined and integrated without including the KEGG, MF GO 5 or MIPS Enzyme reference networks as priors. Precision-Recall curves for the integrated functional network formed from BioGRID data decomposed into 22 networks by experiment type rather (than by study), and for integrated networks with no prior reference networks included, are also displayed in Figure 4-21. The equivalent ROC (Sensitivity vs 1-Specificity) curves can be found in Figure 4-22.

S. J. Lycett 49

Evaluation of Integrated Networks against Myers Positive and Negative Gold Standards

0

0.2

0.4

0.6

0.8

1

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08Recall (TP / TP + FN )

Prec

isio

n ( T

P / T

P +

FP )

Whitening IntegrationIntegration by KeggIntegration by Cellular CompartmentIntegration by Molecular FunctionIntegration by Enzymes

Figure 4-19: Precision-Recall curve for the combined networks and final integrated network.

Evaluation of Integrated Networks without Priors against Myers Positive and Negative Gold Standards

0

0.2

0.4

0.6

0.8

1

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08Recall (TP / TP + FN )

Prec

isio

n ( T

P / T

P +

FP )

Whitening IntegrationIntegration by KeggIntegration by Cellular CompartmentIntegration by Molecular FunctionIntegration by Enzymes

Figure 4-20: Precision-Recall curve for the networks combined and integrated without including the reference networks as priors

S. J. Lycett 50


0

0.2

0.4

0.6

0.8

1

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08Recall (TP / TP + FN )

Prec

isio

n ( T

P / T

P +

FP )

by studyby experiment typeby study (no priors)by experiment type (no priors)

Figure 4-21: Precision-Recall curves for variations on the final integated network


0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.0081-Specificity (TN / FP + TN )

Sens

itivi

ty (

TP /

TP +

FN

)

by studyby experiment typeby study (no priors)by experiment type (no priors)

Figure 4-22: ROC curve for variations on the final integrated network

S. J. Lycett 51

4.4.2 Interaction Predictions

The final integrated network contains probabilistic links between genes or their products, and since the links represent functional interactions, it is assumed that the function of a particular gene is related to the functions of its high probability neighbours, using the ‘guilt-by-association’ principle (Oliver 2000). To test the ability of the final integrated network to correctly predict gene function, the function of the interaction partners of some genes with known annotation were examined. Additionally the functions of the interaction partners of genes with no annotation are sought and predictions about the unknown genes were made.

4.4.2.1 Prediction verification using known genes

Using the same example KEGG pathways (SCE03010 : Ribosome; SCE04111 : Cell cycle; and KO00193 : ATP synthesis) as those used in a paper by Myers (Myers 2005), the high probability neighbours of a gene (YBL027W; YAL023C; and YLR447C) from each pathway were found. Since the KEGG network was included as a prior in the integration, all the genes on the same pathway are linked. However, retrieving the neighbours and their respective GO annotations for an example gene on a known pathway serves to verify the usefulness of integrated functional network and the guilt-by-association principle. The tables below show the GO term predictions made by finding the most popular deepest GO terms of the neighbours for three example genes, together with the actual GO annotations of the genes.

Next, for each of the three pathways considered, the predicted (most popular) and actual deepest GO annotations for all the genes on the pathway were compared across the biological process, cellular component and molecular function categories, and the number of exact matches were recorded. Figure 4-23 displays the percentage of genes on the pathway with 0, 1, 2 or 3 exactly correct predictions for the deepest GO terms.

The figure shows that in the SCE03010 : Ribosome pathway, the deepest GO term predictions match the annotated deepest GO terms in all three categories in about 35% of the genes, and the predicted deepest GO terms do not match any of the annotations in about 2% of the genes. For SCE04111 : Cell Cycle it can be seen that for about 50% of the genes, the most popular deepest GO terms predictions do not match any of the annotated terms, and none of the genes in KO00193 : ATP Synthesis have correct predictions considering the most popular terms. However, as exemplified in Table 4-16 and Table 4-17, sometimes the correct deepest GO terms are not quite the most popular. Consequently the lower part of Figure 4-23 shows the percentage of genes on the pathway that have 0, 1, 2 or 3 correct deepest GO term predictions within the top 5 most popular terms of the biological process, cellular component, and molecular function categories. When the top 5 most popular terms are considered, the percentages of genes with at least one correctly predicted annotation increases to over 80% in all three pathways.

Results for predictions from an integrated network created without using the reference networks as priors are shown in Figure 4-23. In both the Ribosome and Cell cycle pathways, the inclusion of priors improves the prediction performance, however the

S. J. Lycett 52

performance on the ATP synthesis reference pathway is better when the priors are not included.

Table 4-15: Predicted GO terms for YBL027W

Name DescriptionGene YBL027W Ribosomal Protein of the

Large subunitPathway sce03010 Ribosome

Predicted GO Terms Actual GO AnnotationBiological Process GO:0006412 - translation GO:0006412 - translationCellular Component GO:0005842 - cytosolic large

ribosomal subunitGO:0005842 - cytosolic large ribosomal subunit

Molecular Function GO:0003735 - structural constituent of ribosome

GO:0003735 - structural constituent of ribosome

Table 4-16: Predicted GO terms for YAL023C

Name DescriptionGene YAL023C Protein O-

MannosylTransferasePathway sce04111 Cell cycle

Predicted GO Terms Actual GO AnnotationBiological Process GO:0006493 protein amino

acid O-linked glycosylationGO:0006493 protein amino acid O-linked glycosylation

Cellular Component GO:0005783endoplasmic reticulum

GO:0005783endoplasmic reticulum

Molecular Function GO:0000030 mannosyltransferase activity(4th: GO:0004169 dolichyl-phosphate-mannose-protein mannosyltransferase activity)

GO:0004169 dolichyl-phosphate-mannose-protein mannosyltransferase activity

Table 4-17: Predicted GO Terms for YLR447C

Name DescriptionGene YLR447C V-ATPase V0 sector subunit d

(gene product)Pathway ko00193 ATP synthesis

Predicted GO Terms Actual GO AnnotationBiological Process GO:0009060 aerobic

respiration(3rd: GO:0007035 vacuolar acidification)

GO:0007035 vacuolar acidification

Cellular Component GO:0005739 mitochondrion(18th: GO:0005774 vacuolar membrane)

GO:0005774 vacuolar membrane

Molecular Function GO:0003924 GTPase activity(4th: GO:0046933 hydrogen ion transporting ATP synthase activity, rotational mechanism)

GO:0046961 hydrogen ion transporting ATPase activity, rotational mechanism

S. J. Lycett 53

Figure 4-23: Percentage of exact GO term (upper) and top 5 (lower) prediction matches for all the genes in 3 selected KEGG pathways.

4.4.2.2 Predictions for genes with unknown function

The original gene – GO term association data downloaded from GO only contained 6300 genes, but the number of genes in the integrated data was 6674. Hence there were 374 genes with no annotation in the data sets used. 253 of these genes with no annotation had at least one interaction with another gene, so were candidates for function prediction. However, 32 of these genes had very weak interactions or had interactions only with genes with no annotation either. Consequently, there were a total of 221 genes for which predictions have been made on the basis of the most popular GO terms of their interacting partners, using the ‘guilt-by-association’ principle (Oliver 2000).

S. J. Lycett 54

Using the GO term orders (see 3.1.2.2) the levels of the predicted GO terms of the un-annotated genes were recorded, and histograms of the number of genes with predicted GO term levels for biological process, cellular component and molecular function are plotted in Figure 4-24. The distribution of the predicted GO terms amongst levels 1 - 11 for the un-annotated genes approximates the distribution for all the genes. For the un-annotated genes most of the predicted biological process GO terms are of level 7, most of the cellular component terms are of level 3 and most of the molecular function terms are of level 5. However, when considering all the genes, there are slightly more GO term annotations at level 5 than 3 for cellular component.

Number of Genes with Predicted GO Term Level

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10 11GO Level

Num

ber o

f Gen

es

Biological ProcessCellular ComponentMolecular Function

Figure 4-24: Number of GO term predictions at GO term level for un-annotated genes

Since the higher level GO terms are more general, to gain an appreciation of the percentage of detailed predictions, the percentage of the un-annotated genes that have all three GO term predictions of a particular level or below are plotted in Figure 4-25. Here it can be seen that just under 20% of the un-annotated genes have all three GO term predictions at level 5 or below. The predictions for the 3 genes (2%) that have all three GO terms at level 6 or below are displayed in Table 4-18, Table 4-19 and Table 4-20, together with information from the SGD [SGD] and Uniprot [Uniprot].

Percentage of Genes with All GO Term Predictionsat GO Level or Below

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7GO Level

Perc

enta

ge o

f Gen

es

Figure 4-25: Percentage of un-annotated gene with all GO term predictions at or below GO term level

http://www.pir.uniprot.org/


S. J. Lycett 55

Table 4-18: Predicitons for un-annotated gene YGL263W

Gene Name YGL263WSGD Description Protein of unknown function, member of the DUP380 subfamily

of conserved, often subtelomerically-encoded proteins [ygl263w]Predicted GO Terms GO ID GO Description LevelBiological Process GO:0015744 succinate transport 6Cellular Component GO:0005743 mitochondrial inner membrane 6Molecular Function GO:0005469 succinate:fumarate antiporter activity 6Uniprot [P53053]ADDITIONAL INFORMATION FROM iProClassCellular Component GO:0016020 membraneCellular Component GO:0016021 integral to membrane

Table 4-19: Predictions for un-annotated gene YBR007C

Gene Name YBR007CSGD Description Deletion suppressor of mpt5 mutation [ybr007c]Predicted GO Terms GO ID GO Description LevelBiological Process GO:0006348 chromatin silencing at telomere 11Cellular Component GO:0000790 nuclear chromatin 6Molecular Function GO:0004406 H3/H4 histone acetyltransferase activity 10Uniprot [P38213]ADDITIONAL INFORMATION FROM iProClassMolecular Function GO:0042802 identical protein binding Molecular Function GO:0005488 binding

Table 4-20: Predictions for un-annotated gene YBR259W

Gene Name YBR259WSGD Description Putative protein of unknown function; YBR259W is not an

essential gene [ybr259w]Predicted GO Terms GO ID GO Description LevelBiological Process GO:0006366 transcription from RNA polymerase II

promoter7

Cellular Component GO:0005665 DNA-directed RNA polymerase II, core complex

6

Molecular Function GO:0004707 MAP kinase activity 8Uniprot [P38338]ADDITIONAL INFORMATION FROM iProClass (None)

http://www.pir.uniprot.org/cgi-bin/upEntry?id=P38338

http://db.yeastgenome.org/cgi-bin/locus.pl?locus=YBR259W


http://db.yeastgenome.org/cgi-bin/locus.pl?locus=ybr007c


http://db.yeastgenome.org/cgi-bin/locus.pl?locus=ygl263w

S. J. Lycett 56

4.4.3 Discussion

4.4.3.1 Performance and use of Priors

The Precision-Recall and ROC curves of Figure 4-19 - Figure 4-22 show the performance of the integrated network against the Myers Gold standard positive and negative interaction sets. Precision-Recall and ROC curves are a common means to measure discrimination performance and allow comparison of this work with that from other groups. Additionally, the Myers gold standard was used for performance assessment because it was designed to represent the known functional interactions (Myers 2006), but has not been used to inform the creation of the integrated network. (The Myers gold standard was also not considered for use as a reference network because it is not derived from a public reference database). Of course, the Myers gold standard itself may not be an accurate representation of the true functional interactions in S. cerevisiae, however some source of assumed true positive and negative interactions are required in order to generate Precision-Recall and ROC curves.

The Precision-Recall curves of Figure 4-20 show that the integration of the same BioGRID data with respect to different reference networks does not give a large variation in performance against the Myers gold standards if the appropriate reference networks are not included as priors. Additionally, there is only a small performance gain when the combined networks are integrated if the priors are excluded. However, including the KEGG, MF GO 5 and MIPS Enzymes reference networks as priors makes quite a substantial difference to the performance of the combined and final integrated networks as can be seen in Figure 4-19. For instance, the Precision-Recall values of the integrated network without priors is (0.43, 0.03) at a threshold of 0.9, whereas the values are (0.42, 0.07) for the same threshold on the final integrated network (with priors). Although these values might seem low, note that for the Bayesian Integration with the Literature Curated data set by Reguly et al. (Reguly 2006), measured against a Biological Process GO gold standard the values are (0.4, 0.05) (threshold unknown), whilst the values for the Literature Curated data set itself are (0.7, 0.02).

The effect of performing the integration over the BioGRID data split into studies and small studies, or by experiment type does not appear to be very large (see Figure 4-21 and Figure 4-22), but the ROC curve in Figure 4-22 again shows that the inclusion of priors has a big effect on performance with respect to the Myers gold standards. The low sensitivity (true positive rate) and 1-specificity (false positive rate) in all the ROC curve indicates that there are many different interactions present in the BioGRID integrated networks as compared to the Myers gold standards.

An alternative way of validating the performance of the integrated network is to consider using the guilt-by-association principle to assign GO annotations. In section 4.4.2.1, the most popular GO annotations amongst the high probability (>=0.9) neighbours of a gene were used to make predictions about the GO annotations of the gene in the three GO categories (biological process, cellular component, molecular function). Comparing the predictions with the actual annotations for known genes on three KEGG pathways showed that the integrated functional network with the guilt-by-association principle correctly predicted all thee annotations in over 90% of the genes on the Ribosome pathway; and at least one annotation per gene was correctly

S. J. Lycett 57

predicted in over 80% of the genes on the Cell cycle pathway. However, the predictions for the ATP synthesis reference pathway were poor. Since the integrated functional network actually contains the links between genes on the same KEGG pathways as a prior, the performance on the Ribosome and Cell cycles pathways is not surprising. However, excluding the priors from the integration results in only a modest decrease in performance on the Ribosome and Cell Cycle pathways, which means that priors do not account for all of the performance and the experimental data links are adding significant information. In the GO term prediction results from the ATP synthesis reference pathway, the inclusion of the priors has an apparently detrimental effect. However, closer examination of the actual GO term predictions show that the effect of the priors is to boost the popularity of related terms. For example the molecular function of YBR127C and YEL051W are correctly predicted by the integrated network with no priors as:

GO:0046961 hydrogen ion transporting ATPase activity, rotational mechanism

When priors are included the predictions for YBR127C and YEL051W the top 5 predictions for molecular function are :

GO:0003924 GTPase activityGO:0016887 ATPase activityGO:0042626 ATPase activity, coupled to transmembrane movement of substancesGO:0046933 hydrogen ion transporting ATP synthase activity, rotational mechanismGO:0004004 ATP-dependent RNA helicase activity

Consequently the apparent detrimental effect of using the priors on the ATP synthesis reference pathway is in fact caused by the simple guilt-by-association prediction metric used. If a more sophisticated GO term prediction mechanism was used, taking into account the relationship between the retrieved GO terms of the interaction partners of the genes, then it is expected that the prediction performance for the integrated network with priors would become more consistent across the different pathways.

Overall, considering the performance of the integrated functional network with and without priors against the Myers gold standard and the GO term predictions, it is concluded that the using the reference networks as priors results in a significant improvement in the predictive performance of the network.

S. J. Lycett 58

4.4.3.2 Predictions for Un-annotated Genes Analyses of the functional predictions about un-annotated genes somewhat are problematic because the function must be deduced from the available sequence, structure and interaction data. Nevertheless, to validate the predictions about the unknown genes, a comparison is made to predictions from similar integrated functional network tools AVID (Jiang 2005) and BioPIXIE (Myers 2005). Both AVID and BioPIXIE will output a list of interactions for a given gene, and AVID will make GO term predictions for biological process, cellular component and molecular function terms if there is enough interaction information available.

To indicate the differences between the networks, the number of interactions for the three genes on the reference pathways were found. In Table 4-21 it can be seen that the number of AVID and high probability (>=0.9) BioPIXIE interactions are similar, whereas the number of interactions in the integrated functional network (posterior probability network, PP) are similar to the medium probability (>=0.5) BioPIXIE interactions. This means that the individual link probabilities are higher in the integrated functional network than in the BioPIXIE network. Note that although a substantial fraction of the number of interactions in the integrated functional network have come from the priors, there are still a large number of high probability links derived from the BioGRID data.

Table 4-21: Comparison of the number of interactions from genes on KEGG pathways from different integrated networks. Where available, the number of interactions greater than or equal to a particular interaction probability is given.

Gene PP >= 0.9 PP >=0.9 (no priors)

AVID BioPixie>=0.9

BioPixie>=0.5

BioPixie>=0.2

YBL027W 149 55 6 8 122 295YAL023C 85 51 4 4 12 73YLR447C 269 29 12 8 23 391

The same pattern of greater numbers of high probability links in the integrated functional network (PP) than in AVID or BioPIXIE is also seen for the three example un-annotated genes. In the examples considered, there appears to be little agreement about the interactions of the un-annotated genes, although the single high probability link between YBR007C and YMR127C in the integrated functional network is also found as a medium probability link in BioPIXIE. Since AVID did not report any interactions for YBR007C and YBR259W no GO term predictions were made, however a single prediction of GO:0016040 vesicle organisation and biogenesis (biological process) was made for YGL263W, but this AVID prediction does not seem to be very specific. In contrast because the integrated functional network has many higher probability links, predictions can be made for most un-annotated genes (221 out of 374).

S. J. Lycett 59

Table 4-22:Interaction comparison for YGL263W

Gene PP Score AVID DT Score BioPixie ScoreYJR095W 0.959YBR302C 0.788 (BP)YHL044W 0.788 (BP)YHL048W 0.788 (BP)YNR075W 0.708 (BP)YJR020W 0.750 (CC) 0.203YGL051W 0.810 (MF)YDL116W 0.203

Table 4-23:Interaction comparison for YBR007C

Gene PP Score AVID DT Score BioPixie ScoreYMR127C 1.000 (No interactions) 0.703YLL024C 0.208YDR190C 0.208YDL229W 0.208YER165W 0.208YBR118W 0.208YBR127C 0.208

Table 4-24:Interaction comparison for YBR259W

Gene PP Score AVID DT Score BioPixie ScoreYOL005C 1 (No interactions) (No interactions >

0.2)YBL016W 0.974

Although it is difficult to assess the quality of the predictions from un-annotated genes, without further laboratory investigation, a relatively high degree of confidence can be placed in them because the predictions from the annotated genes in the previous section were reasonably good. In summary, a greater number of high probability links are predicted by the integrated network resulting from this study, than are generated by a range of other programs. Furthermore, unlike the other programs, the network constructed in this study is able to assign functional predictions to most of the previously un-annotated genes selected.

S. J. Lycett 60

5 Conclusions

The behaviour of genes and proteins can aid the understanding of biological processes and functions with a cell. The interaction patterns between different genes or their products are captured in a functional interaction network by linking those that share a functional interaction. Functional interactions can be measured by a variety of experimental techniques, and the results from several experiments can be combined to get a complete picture of the functional interaction network. Expert curated data sets representing particular types of functional interaction information are often used as gold standards, to which other data sets are compared. However, existing methods for creating integrated functional networks tend to bias the resulting network towards a particular gold standard, or avoid the use of gold standards altogether and thus do not incorporate information from expert-curated data sets. Hence, the original aim of this project was to develop a method for creating integrated functional interaction networks from diverse data sources including high-throughput experimental results and expert-curated interaction data.

In the first phase of the project, the types of functional interaction present in a range of data sets, typified by those used by Lee and co workers (Lee 2004), or downloaded from MIPS and BioGRID, were investigated. In addition a log-likelihood ratio method for measuring the similarity between networks was developed. It was found that data sets arising from different experimental techniques contained different types of functional interactions as measured by network overlaps, the log-likelihood ratio scores and analysis of GO term annotations. These findings confirmed that it was not appropriate to use single gold standard to validate interaction data from all the different types of experiment.

In the second phase of the project novel integration methods were developed. Firstly, a method for choosing the most appropriate set of reference networks to validate the diverse data types was developed. This method involved identifying reference networks that were as different as possible from each other. As a result of the analysis a set of four reference networks derived from expert-curated databases were chosen to represent different types of functional link. The reference networks chosen were derived from KEGG Pathways, Cellular Component GO annotations, Molecular Function GO annotations and MIPS Enzymes. Secondly, a Bayesian data fusion method was developed to calculate the probability of functional interactions by updating the prior odds for the functional interactions with the integrated log-likelihood ratio scores from the experimental data sets as measured against each of the four reference networks. Since the links in the KEGG Pathways, Molecular Function GO and MIPS Enzymes reference networks represent known functional interactions, the links in these networks were included in their respective integrations as prior information with a high probability. Finally, a method for combining the four separately integrated networks was developed.

In the third phase of the project an integrated functional network was created using the integration method described above, together with data from BioGRID. This integrated functional network was evaluated against a gold standard derived by Myers and co-workers (Myers 2006). By comparing the Precision-Recall characteristics of the integrated functional network with a network created without including the

S. J. Lycett 61

reference networks as priors, it was concluded that including the reference networks as priors had a significant improvement on the performance of the integrated functional network. To test the predictive capability of the integrated functional network, predictions of GO term annotations on genes / proteins from known pathways were made and verified. Predictions of GO term annotations for un-annotated genes / proteins were also made. Comparing these results with those from previous studies showed that the integrated functional network developed here was able to predict the function of more unknown genes/ proteins at a higher level of detail than previously described methods.

In conclusion, it has been shown that the methods developed here result in a comprehensive integrated functional network for Saccharomyces cerevisiae, weighted using a principled statistical approach which incorporates the maximum available information and reduces bias in the resulting network. The network exhibits superior gene / protein function prediction to other interaction-based guilt-by-association methods previously published. Hence this project contributes to the field of Systems Biology by providing a set of improved algorithms and methods for the integration of interaction data from diverse sources and for the prediction of gene / protein function. Networks constructed using this approach can be further analyzed in many ways, to increase our understanding of the behaviour of genes and their products within cells

6 Future Work

The methods developed here are suitable for creating integrated functional networks for a range of organisms, provided that some experimental and reference data exists. Both KEGG and GO contain data for many organisms including Humans. MIPS has data for a range of fungi and plants. If an alternative reference source is required, then it is recommended that the Principal Component Analysis in section 4.3 is performed in order to select the best set of reference data.

S. J. Lycett 62

7 References

7.1 LiteratureAntonov A. V., Tetko I. V., Mewes H. W., (2006) A systematic approach to infer biological relevance and biases of gene network structures Nucleic Acids Res. 34:1 e6

Ashburner M., (2000) Gene Ontology: tool for the unification of biology Nature Genetics 25 : 25-29

Bader G. et al., (2001) BIND—The Biomolecular Interaction Network Database Nucleic Acids Research 29(1):242-245

Barabási A-L., Oltvai Z. N., (2004) Network Biology: Understanding the cells's functional organization Nature Reviews Genetics 5 : 101-113

Bork P., Jensen L. J., von Mering C., Ramani A. K., Lee I., Marcotte E. M. (2004) Protein interaction networks from yeast to human Curr Opin Struct Biol 14(3) : 292-299

Bruggeman F. J., Westerhoff H. V., (2007) The nature of systems biology Trends in Microbiology 15 1:45-50

Camoglu O., Can T., Singh A. K. (2006) Integrating multi-attribute similarity networks for robust representation of the protein space, Bioinformatics 22(13):1585-1562

Cuisk M. E, Klitgord N., Vidal M., Hill D. E., (2005) Interactome: gateway into systems biology Human Molecular Genetics 14 Review Issue 2

Deng M., Chen T., Sun F., (2003) An integrated probabilistic model for functional prediction of proteins RECOMB. 2003 April 10-13 2003 95-103, Berlin, Germany

Gavin A-C., et al (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes Nature 415 : 141-147

Güldener U et al., (2006) MPact: the MIPS protein interaction resource on yeast Nucleic Acids Research 34 : D436-D441

Hart G. T., Ramani A. K., Marcotte E. M. (2006) How complete are current yeast and human protein-interaction networks ? Genome Biology 7(11):120

Heckerman D., Geiger D., Chickering D. M., (1995) Learning Bayesian Networks: The Combination of Knowledge and Statistical Data Machine Learning 20 : 197-243

Hermjakob H., et al., (2004) IntAct: an open source molecular interaction database Nucleic Acids Research 32(Database issue):D452-D455

Ho Y., et al (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry Nature 415 : 180-183

Huh W-K., et al., (2003) Global analysis of protein localization in budding yeast Nature 425: 686-691

Hull D. et al., (2006) Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34(Web Server issue):W729-W732

Huttenhower, C. and Troyanskaya, O. (2006) Bayesian data integration: a functional perspective. Comput. Syst. Bioinformatics 341-351

S. J. Lycett 63

Hwang D., et al., (2005) A data integration methodology for systems biology PNAS 102 48:17296-17301

Ideker T., Galitski T., Hood L., (2001) A New Approach to Decoding Life : Systems Biology Ann Rev Genomics Human Genet 2 : 343-372

Ito T., et al., (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome PNAS 98(8) : 4569-4574

Jaimovich A., Elidan G., Margalit H., Friedman N., (2005) Towards and Integrated Protein-protein Interaction network Proceedings of RECOMB 2005

Jansen D., et al., (2003) A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data Science 302 5644:449-453

Jeong H., Tombor B., Albert R., Oltvai Z. N., Barabási A-L., (2000) The large-scale organization of metabolic networks Nature 407 : 651-654

Jeong H., Masoni S. P., Barabási A.-L., Oltvai Z. N., (2001) Lethality and centrality in protein networks Nature 411 : 41-42

Jiang T., Keating A. E., (2005) AVID: An integrative framework for discovering functional relationships among proteins BMC Bioinformatics 6:136

Joyce A. R., Pallson, B. O., (2006) The model organism as a system: integrating 'omics data sets Nature Reviews Molecular Cell Biology 7 : 198-210

Kanehisa M., Goto S., (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes Nucleic Acids Research 28(1) : 27-30

Kanehisa, M., et al., (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34 : D354-357

Kiemer L., Costa S., Ueffing M., Cesareni G., (2007) WI-PHI: A weighted yeast interactome enriched for direct physical interactions Proteomics 7 932-943

Kitano H. (2002) Computational Systems Biology Nature 420 206-210

Kitano H. et al (2002) Systems Biology: A brief overview Science 295 : 1662-1664

Koks D., Challa S., (2005) An Introduction to Bayesian and Dempster-Shafter Data Fusion DSTO, Australia DSTO-TR-1436

Lee I., Date S. V., Adai A. T., Marcotte E. M., (2004) A Probabilistic Functional Network of Yeast Genes Science 306(5701) : 1555-1558

Lu J. L., Xia Y., Paccanaro A., Yu H., Gerstein M., (2005) Assessing the limits of genomic data integration for predicting protein networks Genome Res. 15 945-953

Mews H. W., et al., (1999) MIPS: a database for genomes and protein sequences Nucleic Acids Research 27(1) : 44-48

Mukherjee S., Bal S., Saha P., (2001) Protein interaction maps using yeast two-hybrid assay Current Science 81(5):458-464

Myers, C.L. et al. (2005) Discovery of biological networks from diverse functional genomic data. Genome Biology 6 13:R114

Myers, C.L., Barret, D.A., Hibbs, M.A., Huttenhower, C. & Troyanskaya, O.G. (2006) Finding function: an evaluation framework for functional genomics,BMC Genomics 7:187

S. J. Lycett 64

Oinn T. et al., (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows, Bioinformatics 20(17) : 3045 - 3054

Oinn T., et al., (2005) Taverna: Lessons in creating a workflow environment for the life sciences Concurrency and Computation: Practice and Experience, 18(10):1067-1100, Grid Workflow Special Issue, August 2005 (Online version December 2005)

Oliver S., (2000) Guilt-by-association goes global Nature 403 601-603

Przulj N., Wigle D. A., Jurisica I., (2004) Functional topology in a network of protein interactions Bioinformatics 20 3 : 340-348

Reguly T., et al., (2006) Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae J. Biology 5 4:11

Rhodes D. R. et al., (2005) Probabilistic model of the human protein-protein interaction network Nature Biotech. 23 8:951

Ruepp A., et al., (2004) The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes Nucleic Acids Res. 32(18) : 5539-5545

Stark C., et al., (2006) BioGRID: a general repository for interaction datasets Nucleic Acids Research 34(Database Issue) : D535-D539

Searls D., (2005) Data integration: challenges for drug discovery Nature Reviews Drug Discovery 4 : 45-58

Spellman P. T., et al., (1998) Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization Mol. Bio. Cell 9 12:3273-3297

Shlens J. (2005) A tutorial on Principal Component Analysis, Carnegie Mellon Univ., School of Computer Science, http://www.cs.cmu.edu/~elaw/papers/pca.pdf (last accessed 16/08/07).

Srinivasan B. S.,Novak A., F., Flannick J. A., Batzoglou S., McAdams H. H., (2006) Integrated Protein Interaction Networks for 11 Microbes.RECOMB 2006: 1-14

Stark C., et al., (2006) BioGRID: a general repository for interaction datasets Nucleic Acids Research 34(Database Issue):D535-D539

Tong A. H. Y., et al., (2002) A Combined Experimental and Computational Strategy to Define Protein Interaction Networks for Peptide Recognition Modules Science 295 (5553) : 321-324

Tong A. Y. T., et al (2001) Systematic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants Science 294 : 5550:2364-2368

Tyers M., Mann M., (2003) From genomics to proteomics Nature 422 : 193 - 197

Uetz P., et al., (2002) A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae Nature 403 : 623-627

Ulitsky I., Shamir R., (2007) Identification of functional modules using network topology and high-throughput data BMC Systems Biology 1 : 8

von Mering C., et al., (2002) Comparative assessment of large-scale data sets of protein–protein interactions Nature 417 : 399-403

http://www.cs.cmu.edu/~elaw/papers/pca.pdf

S. J. Lycett 65

Wilkinson D. J., (2007) Bayesian methods in bioinformatics and computational systems biology Briefings in Bioinformatics Advance Access 12 April 07

Xenarios I., et al., (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions Nucleic Acids Research 30(1):303-305

Yamanishi Y., Vert J. P., Kanehisa M., (2004) Protein network inference from multiple genomic data: a supervised approach. Bioinformatics 20(Suppl 1):I363-I370

Yellaboina S., Goyal K., Mande S. C., Inferring genome-wide functional linkages in E. coli by combining improved genome context methods: Comparison with high-throughput experimental data, Genome Res. Mar 5 2007

S. J. Lycett 66

7.2 URLs[Bayesian Inference Def] Description of Bayesian Inference – Wikipediahttp://en.wikipedia.org/wiki/Bayesian_inference last accessed: 25/08/07

[BioGRID] BioGRID database last accessed: 16/08/07http://www.thebiogrid.org/

[BIND] Biomolecular Interaction Network Database last accessed: 24/08/07http://bond.unleashedinformatics.com/

[DIP] Database of Interacting Proteins last accessed 24/08/07http://dip.doe-mbi.ucla.edu/dip/Download.cgi

[GO] Gene Ontology project last accessed: 16/08/07http://www.geneontology.org/

[IntAct] IntAct interaction database last accessed: 24/08/07http://www.ebi.ac.uk/intact/site/index.jsf

[KEGG] Kyoto Encyclopedia of Genes and Genomes last accessed: 16/08/07http://www.genome.ad.jp/kegg/

[Lee Data] Supplementary Data from Lee (2004) last accessed: 25/08/07http://www.sciencemag.org/cgi/content/full/sci;306/5701/1555/DC1

[MATLAB] MATLAB (High Level Language) last accessed: 16/08/07http://www.mathworks.com

[MATLAB svd] Singular Value Decomposition function last accessed: 16/08/07http://www.mathworks.com/access/helpdesk/help/techdoc/ref/svd.html.

[MIPS] MIPS database last accessed: 16/08/07http://mips.gsf.de/genre/proj/yeast/index.jsp

[MIPS FTP] MIPS Experimental Data Download last accessed: 16/08/07ftp://ftpmips.gsf.de/yeast/PPI

[Prior Def] Prior probability definition example last accessed: 17/08/07http://cancerweb.ncl.ac.uk/cgi-bin/omd?prior+probability

[ROC] ROC Curves – Wikipedia last accessed: 20/08/07http://en.wikipedia.org/wiki/Receiver_Operating_Characteristic

[SGD] Saccharomyces Genome Database last accessed: 16/08/07http://www.yeastgenome.org/

[SGD Glos] SGD Glossary of Experimental Terms: last accessed: 16/08/07http://www.yeastgenome.org/help/glossary.html

[SMD] Standford Microarray Database last accessed: 16/08/07http://genome-www5.stanford.edu/

[Uniprot] Universal Protein Resource last accessed : 22/08/07http://www.pir.uniprot.org/

http://www.pir.uniprot.org/

http://genome-www5.stanford.edu/

http://www.yeastgenome.org/help/glossary.html


http://en.wikipedia.org/wiki/Receiver_Operating_Characteristic

http://cancerweb.ncl.ac.uk/cgi-bin/omd?prior+probability

ftp://ftpmips.gsf.de/yeast/PPI


http://www.mathworks.com/access/helpdesk/help/techdoc/ref/svd.html

http://www.mathworks.com/

http://www.sciencemag.org/cgi/content/full/sci;306/5701/1555/DC1


http://www.ebi.ac.uk/intact/site/index.jsf


http://dip.doe-mbi.ucla.edu/dip/Download.cgi

http://bond.unleashedinformatics.com/


http://en.wikipedia.org/wiki/Bayesian_inference

S. J. Lycett i

8 Appendix

8.1 BioGRID Networks by Study and Small Scale Studies

Study ExperimentType Number of Genes

Number of Interactions

Krogan nj et al. affinity capture-ms 2700 8138Hazbun tr et al. affinity capture-ms 143 151Gavin ac et al. affinity capture-ms 1708 10992Ho y et al. affinity capture-ms 1563 3666Lindstrom dl et al. affinity capture-ms 93 109Nissan ta et al. affinity capture-ms 77 162Grandi p et al. affinity capture-ms 90 509Ohi md et al. affinity capture-ms 45 161Sanders sl et al. affinity capture-ms 194 536Saveanu c et al. affinity capture-ms 64 133Baetz kk et al. affinity capture-ms 26 122Panse vg et al. affinity capture-ms 117 117Frazier ae et al. affinity capture-ms 63 102Allen np et al. affinity capture-ms 54 170Collins sr et al. affinity capture-ms 1620 9064Hannich jt et al. affinity capture-ms 146 145Zhao r et al. affinity capture-ms 127 131Graumann j et al. affinity capture-ms 317 478other affinity capture-ms 1345 3881other affinity capture-rna 44 57other affinity capture-western 1989 6553Ubersax ja et al. biochemical activity 188 369Ptacek j et al. biochemical activity 1359 4183other biochemical activity 448 741other co-crystal structure 118 134other co-fractionation 443 470other co-localization 262 309Stevens sw et al. co-purification 98 118other co-purification 763 1302other dosage growth defect 63 44other dosage lethality 394 427other dosage rescue 1689 3181other far western 53 43other fret 35 61Collins sr et al. phenotypic enhancement 716 11606Schuldiner m et al. phenotypic enhancement 391 4043other phenotypic enhancement 1126 1970Collins sr et al. phenotypic suppression 582 2815Schuldiner m et al. phenotypic suppression 301 1038other phenotypic suppression 652 747other protein-peptide 105 107

S. J. Lycett ii

Measday v et al. protein-rna 85 175Pan x et al. protein-rna 770 4452other protein-rna 1109 1658other protein-rna 17 10other reconstituted complex 1139 2155Davierwala ap et al. synthetic lethality 299 564Krogan nj et al. synthetic lethality 200 954Tong ah et al. synthetic lethality 918 4211Pan x et al. synthetic lethality 402 1190Finger fp et al. synthetic lethality 19 133Daniel ja et al. synthetic lethality 128 214Kong se et al. synthetic lethality 30 112Zhao r et al. synthetic lethality 273 272Lesage g et al. synthetic lethality 228 497Loeillet s et al. synthetic lethality 69 127Milgrom e et al. synthetic lethality 108 107other synthetic lethality 1587 3876other synthetic rescue 1200 2065Ito t et al. two-hybrid 778 786Tong ah et al. two-hybrid 144 232Uetz p et al. two-hybrid 926 875Miller jp et al. two-hybrid 535 1977Fromont-racine m et al. two-hybrid 304 433Drees bl et al. two-hybrid 108 205Newman jr et al. two-hybrid 78 177Millson sh et al. two-hybrid 147 160other two-hybrid 1789 3836Table 8-25: BioGRID data sets by study and small scale study

S. J. Lycett iii

8.2 Taverna Workflows used to create KEGG NetworksTwo workflows to access the KEGG database through its webservice interface were written using Taverna (Hull 2006, Oinn 2004, Oinn 2005) as shown below.

Workflow to create KEGG Pathways network

Workflow to create the Direct KEGG network

In these workflows the boxes are functions, and the colour of the boxes represent the type of webservice / program: Green boxes – webservices to the KEGG database provided by DDBJ. Purple boxes – in-built Taverna local java services Orange boxes – user defined Java Bean Shell Scripts

The custom written Java Bean Shell Scripts are: Remove_Prefix – removes the prefix ‘sce:’ from the list of gene names output

From get_genes_by_enzymes List_Interactions – converts each input list of gene names into a tab delimited list

of interactions in the form GeneA GeneB.

S. J. Lycett iv

8.3 Lee Log Likelihood Score

The log-likelihood ratio score used in this project is:

The log-likelihood ratio score used by Lee et al. (Lee 2004) is :

Where the terms map to the terms used in 4.2.1.2 as follows:

Lee Term Symbol Lee meaning MeaningP(L|E) n1,1 “frequency of linkages L observed in Experiment

E between genes in the same pathway”Number of true positives

~P(L|E) n1,0 “frequency of linkages L observed in Experiment E between genes in different pathways”

Number of false positives

- n0,1 Number of false negatives

- n0,0 Number of true negatives

P(L) n1 “total frequency of linkages in the same pathway” Number of positives

~P(L) n0 “total frequency of linkages in different pathways”

Number of negatives

8.4 Algorithm Implementation

8.4.1 Log-Likelihood Ratio Scores

To calculate the log-likelihood ratio score between a data network (D) and a gold standard network (G), or infact any pair of networks, the following quantities are required:

Symbol Meaning Namen1,1 number of links in both D and G True Positives (TP)n1,0 number of links in D that are not in G False Positives (FP)n0,1 number of links in G that are not in D False Negatives (FN)n0,0 number of links not present in both D and G True Negatives (TN)N1 Total number of positives in G = n1,1 + n0,1 Number of positives (NP)N0 Total number of negatives in G = n0,0 + n1,0 Number of negatives (NN)

In MATLAB the calculation of the number of n1,1, n1,0, n0,1 and n0,0 interactions can be easily and efficiently achieved by representing the data D and gold standard G as sparse adjacency matrices and calculating (by element wise matrix addition) M = 2*D

S. J. Lycett v

+ G. n1,1 is then just the number of elements in the matrix M with the value 3, n 1,0 is the number of elements with the value 2, n0,1 is the number of elements with the value 1 and n0,0 is the number of elements with the value 0. Note that calculating n0,0

directly is very inefficient when the MATLAB sparse matrix format is used, so the calculation n0,0 = total possible links – n1,1 – n1,0 – n0,1 is used instead.

Listing for MATLAB function slogL.m

% function to calculate logL for sparse matrices% S. J. Lycett% 29 April 07% 17 July 07

% function to calculate log likelihood ratio% inputs:% D = data adjacency matrix% G = gold standard, the 'truth', assume 0's and 1's only

% L1 = likelihood for H1, genes linked% L0 = likelihood for H0, genes not linked% L = L1 / L0 = (n1,1 / n1,0) / (n0,1 / n0,0)% L = (n1,1 x n0,0) / (n1,0 x n0,1)% n1,1 = number of gene-pairs linked in D, and linked in G% n1,0 = number of gene-pairs linked in D, but not linked in G% n0,1 = number of gene-pairs not linked in D, but linked G% n0,0 = number of gene-pairs not linked in D, and not linked in G

% this function is OK with big sparse matrices

function [LL,L1,L0,n11,n10,n01,n00]=slogL(D, G, maxVal)

if (nargin < 3) maxVal = 10; end

DG = (2*D)+G;% DG = 0 -> not D and not G = n00% - do not calculate n00 directly !% DG = 1 -> not D and G = n01% DG = 2 -> D and not G = n10% DG = 3 -> D and G = n11

inds = find( DG == 3 );n11 = size(inds,1);

inds = find( DG == 2);n10 = size(inds, 1);

inds = find( DG == 1 );n01 = size(inds, 1);

inds = find( DG >= 1);n00 = ( size(DG, 1)^2 ) - size(inds, 1);

if ( (n11 > 0) & (n00 > 0) ) logn11 = log(n11); logn00 = log(n00); else if (n11 > 0)

S. J. Lycett vi

logn11 = log(n11); logn00 = 0; elseif (n00 > 0) logn11 = -maxVal; logn00 = log(n00); else logn11 = -maxVal; logn00 = -maxVal; end end if ( (n01 > 0) & (n10 > 0) ) logn10 = log(n10); logn01 = log(n01); else if (n01 > 0) % if some links in G that are not in D,

% but no links in D that are not also in G (False Negs) logn10 = 0; logn01 = log(n01); elseif (n10 > 0) % if some links in D that are not in G

% but no links in G that are not also in D (False Pos) logn10 = log(n10); logn01 = 0; else % no false negs or pos logn10 = -maxVal; logn01 = -maxVal; end end N1 = n11 + n01; N0 = n00 + n10; L1 = (logn11 - log(N1)) - (logn10 - log(N0)); L0 = (logn01 - log(N1)) - (logn00 - log(N0)); LL = logn11 + logn00 - logn10 - logn01;

8.4.2 Network integration with respect to a reference network

Listing for the MATLAB function networkIntegration.m

% sub-network integration (integration given one reference network)% S. J. Lycett% 15 July 07% 20 July 07

% inputs% data network : di, dj, ds

% reference network : ri, rj

% optional inputs% maxVal : number of nodes in final network% set to 6674 for S. cerevisiae

% incRef : log likelihood ratio for including reference% network as a prior, default = 1

S. J. Lycett vii

% set to zero if dont want to include

% infL : value to use instead of infinite log likelihood% ratio in data log likelihood ratio calc.% Default = 20

% outputs% PP : posterior probability sparse adjacency matrix% (the integrated data)

% optional outputs% LL : vector of log likelihood rations% one for each data type (link code)

% Format for input networks is 'list format'% di, ri = 1D array of genes A indicies% dj, rj = 1D array of genes B indicies% ds = 1D array of link codes for data

% Format for output network is MATLAB sparse matrix

function [PP, LL, ln_c] = networkIntegration(di, dj, ds, ri, rj, maxVal, incRef, infL)

if (nargin < 6) maxVal = max( [ max(di), max(dj), max(ri), max(rj) ] ); end if (nargin < 7) incRef = 1; end if (nargin < 8) infL = 20; end du = unique(ds, 1); numData = size(du, 2); LL = zeros(numData, 1); PP = sparse(maxVal, maxVal); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Step 1 & 2 : calculate log likelihood ratios for data against% reference and sum log likelihood ratios%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for (k = 1 : numData) % get individual data set k_inds = (ds == du(k) ); di_k = di( k_inds ); dj_k = dj( k_inds ); % make adj matrix for data and reference over common nodes only [dataNet, refNet] = reduceMatrices(di_k, dj_k, ri, rj); % Step 1 % calculate log likelihood score LL(k) = slogL(dataNet, refNet);

S. J. Lycett viii

if ( LL(k) == Inf ) LL(k) = infL; elseif ( LL(k) == -Inf ) LL(k) = -infL; elseif ( ~finite( LL(k) ) ) LL(k) = 0; end % Step 2 % add scored data to integrated network % weights are log L at moment % make sure have undirected data links PP = PP + ( makeUMat( di_k, dj_k, maxVal ) * LL( k ) ); clear di_k; clear dj_k; clear ds_k; clear k_inds; clear dataNet; clear refNet; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Step 3 : add the prior%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% if ( incRef > 0 ) % include actual reference network in integrated network

% as a prior % add reference network to likelihood ratios % dont want to add infinities so assume that likelihood ratio

% of ref network is a large value % add the reference network as a prior to the integrated data % make sure have undirected reference links refNet = makeUMat(ri, rj, maxVal); PP = PP + (refNet * infL); clear refNet; end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Step 4 : convert log posterior odds to posterior probability% and make posterior probability adj matrix%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% extract the ppi, ppj, pps [ppi, ppj, pps] = find(PP); % now pps = sum log likihood ratios + log prior % = log posterior odds % convert pps to posterior probability pps = exp(pps); pps = pps ./ (1 + pps); % make posterior probability adjacency matrix PP = sparse(ppi, ppj, pps, maxVal, maxVal);

S. J. Lycett ix

8.4.3 Final Integration

Listing for whiteningIntegration.m

% integration of integrations vs references% S. J. Lycett% 16 July 07% 27 July 07

% Functional link if% Linked in wrt Kegg, CC, MF or Enzyme% Not linked if not linked in any of these

% p(linked) = 1 - (1-pKegg)(1-pCC)(1-pMF)(1-pEnyme)

% Inputs - all MATLAB sparse matrices% posterior probabilities from networkIntegration% PP1 = Post prob wrt Kegg% PP2 = Post prob wrt CC GO 5% PP3 = Post prob wrt MF GO 5% PP4 = Post prob wrt Enzyme

% Optional inputs (for text output)% fname - the file name for the output file% gname - the file name containing the systematic gene names% for each gene index. If gname is not supplied,% then the gene indices are used

% Output% M = Final integrated probability adjacency matrix% (MATLAB sparse format)

function [M] = whiteningIntegration(PP1, PP2, PP3, PP4, fname, gname)

maxVal = size(PP1,1); deg1 = sum(PP1); deg2 = sum(PP2); deg3 = sum(PP3); deg4 = sum(PP4); degAny = deg1 + deg2 + deg3 + deg4; M = sparse(maxVal, maxVal); if (nargin >= 5) fid = fopen(fname, 'w');

header = sprintf( 'GeneA\tGeneB\tP(Kegg)\tP(CC)\tP(MF)\tP(Enzyme)\n');

fprintf(fid, '%s', header ); end if (nargin == 6) [geneNames] = loadGeneNames(gname); end for (a = 1 : maxVal) if ( mod(a, 100) == 0 ) sprintf('Doing %d of %d',a,maxVal) end for (b = 1 : maxVal)

S. J. Lycett x

p1 = full(PP1(a,b)); p2 = full(PP2(a,b)); p3 = full(PP3(a,b)); p4 = full(PP4(a,b)); if ( (p1 | p2 | p3 | p4) > 0 ) if ( (p1 < 1) & (p2 < 1) & (p3 < 1) & (p4 < 1) ) np1 = 1 - p1; np2 = 1 - p2; np3 = 1 - p3; np4 = 1 - p4; fp = 1 - (np1 * np2 * np3 * np4); else fp = 1; end M(a,b) = fp; if (nargin == 5) line = sprintf(

'%d\t%d\t%1.4f\t%1.4f\t%1.4f\t%1.4f\t%1.4f\n', a, b, p1, p2, p3, p4, fp)

fprintf(fid, '%s', line ); end if (nargin == 6) geneAName = char(geneNames( a )); geneBName = char(geneNames( b )); line = sprintf(

'%s\t%s\t%1.4f\t%1.4f\t%1.4f\t%1.4f\t%1.4f\n', geneAName, geneBName, p1, p2, p3, p4, fp);

fprintf(fid, '%s', line ); end end end end if (nargin >= 5) fclose(fid); end end

homepages.cs.ncl.ac.ukhomepages.cs.ncl.ac.uk/anil.wipat/home.formal/projectinfo/sams t… · web...

Documents