hugh shanahan association talk

61
Associative methods in Systems Biology Hugh Shanahan Outline Gene Ontologies Over-representation Semantic similarity Associative Measures Hypotheses Linear Correlation Partial Correlation Non-linear measures Validation DREAM Associative methods in Systems Biology Hugh Shanahan Department of Computer Science Royal Holloway, University of London September 22, 2009 Hugh Shanahan Associative methods in Systems Biology

Upload: hugh-shanahan

Post on 25-May-2015

435 views

Category:

Documents


5 download

DESCRIPTION

Talk I gave at this year\'s CCC Summer School at Zhejiang University, Hangzhou

TRANSCRIPT

Page 1: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Associative methods in Systems Biology

Hugh Shanahan

Department of Computer ScienceRoyal Holloway, University of London

September 22, 2009

Hugh Shanahan Associative methods in Systems Biology

Page 2: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Outline

1 Outline

2 Gene OntologiesOver-representationSemantic similarity

3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures

4 ValidationDREAM

Hugh Shanahan Associative methods in Systems Biology

Page 3: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Gene Ontologies

Before finding interactions, need to be able

to systematically annotate all genesto determine which functional groupings areover-representedmeasure objectively the “functional similarity” of twogenes.

Gene Ontology (GO) is a means to do this.

Hugh Shanahan Associative methods in Systems Biology

Page 4: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Ontologies

Abstract method for expressing structured data.Annotation of any gene can be expressed in terms ofincresingly accurate description, e.g.This gene is involved in transport.This gene is involved in vesicle mediatedtransport.This gene is involved in vesicle fusion.Genes may not have an accurate annotation, sodefinition can stop higher up in this hierarchy.

Hugh Shanahan Associative methods in Systems Biology

Page 5: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

More complexity required

Annotation is not a simple chain.A single gene can have have a very specific annotation,which comes from two (or more) more generaldescriptions.Different types of annotation as well: location,biochemistry, part of organism expressed in, and so on.An Ontology is a Directed Acyclic Graph (DAG), not aTree (this means a lot to Graph Theorists).Each node in the DAG is an annotation term.Each “child” node can more than one “parent” nodes.

Hugh Shanahan Associative methods in Systems Biology

Page 6: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

GO’s visualised

Nature Reviews | Genetics

Biologicalprocess (root)

Vesicle fusion

part_of is_a

Vesicle-mediated transport Membrane fusion

Transport Membrane organizationand biogenesis

is_ais_a

cbParent

Child

Increasingspecificityand/orgranularity

a

not been shown robustly. As of October 2007, there are over 16 million GO annotations. Strikingly, over 95% of these annotations are computationally derived and have not been manually curated; these are associated with the ‘inferred from electronic annotation’ (IEA) evidence code. Most of these annotations come from the GO annotation project at the European Bioinformatics Institute (GOA9). In addition to the GOA set, individual model organisms also have a substantial portion of annotations not derived

from direct experimental evidence (TABLE 2). Among the 27 organisms with more than 5,000 annotations, the portion of genes with at least one experimentally derived annotation varies widely from 1.1% to 90.9%. Although computational and indirectly derived annotations increase coverage significantly, they probably contain a higher portion of false positives. Researchers who use GO anno-tations should be cognizant of the differences between annotations associated with different evidence codes.

Figure 1 | Simple trees versus directed acyclic graphs. Boxes represent nodes and arrows represent edges. a | An example of a simple tree, in which each child has only one parent and the edges are directed, that is, there is a source (parent) and a destination (child) for each edge. b | A directed acyclic graph (DAG), in which each child can have one or more parents. The node with multiple parents is coloured red and the additional edge is coloured grey. c | An example of a node, vesicle fusion, in the biological process ontology with multiple parentage. The dashed edges indicate that there are other nodes not shown between the nodes and the root node (biological process). A root is a node with no incoming edges, and at least one leaf (also called a sink). A leaf node is a node with no outgoing edges, that is, a terminal node with no children (vesicle fusion). Similar to a simple tree, A DAG has directed edges and does not have cycles, that is, no path starts and ends at the same node, and will always have at least one root node. The depth of a node is the length of the longest path from the root to that node, whereas the height is the length of the longest path from that node to a leaf41. is_a and part_of are types of relationships that link the terms in the GO ontology. More information about the relationships between GO terms are found online (An Introduction to the Gene Ontology).

Table 1 | Evidence codes used by GO

Evidence code

Evidence code description Source of evidence Manually checked

Current number of annotations*

IDA Inferred from direct assay Experimental Yes 71,050

IEP Inferred from expression pattern Experimental Yes 4,598

IGI Inferred from genetic interaction Experimental Yes 8,311

IMP Inferred from mutant phenotype Experimental Yes 61,549

IPI Inferred from physical interaction Experimental Yes 17,043

ISS Inferred from sequence or structural similarity Computational Yes 196,643

RCA Inferred from reviewed computational analysis Computational Yes 103,792

IGC Inferred from genomic context Computational Yes 4

IEA Inferred from electronic annotation Computational No 15,687,382

IC Inferred by curator Indirectly derived from experimental or computational evidence made by a curator

Yes 5,167

TAS Traceable author statement Indirectly derived from experimental or computational evidence made by the author of the published article

Yes 44,564

NAS Non-traceable author statement No ‘source of evidence’ statement given Yes 25,656

ND No biological data available No information available Yes 132,192

NR Not recorded Unknown Yes 1,185*October 2007 release

REVIEWS

510 | JULY 2008 | VOLUME 9 www.nature.com/reviews/genetics

Rhee et al., Nature Reviews Genetics, (2008)

Hugh Shanahan Associative methods in Systems Biology

Page 7: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

GO’s visualised

http://amigo.geneontology.org/

Hugh Shanahan Associative methods in Systems Biology

Page 8: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Different types of Annotation

Typically, there are three distinct ontologies(overwhelmingly used).Cellular CompartmentBiological ProcessMolecular FunctionMany other ontologies have been constructed, e.g.Plant Organ for Arabidopsis.

Hugh Shanahan Associative methods in Systems Biology

Page 9: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Caveat

The annotation of most genes (90%) have been carried outcomputationally. The annotations usually work pretty well,though they have a tendency not to be as accurate as thoseobtained by direct assay.All annotated genes have an evidence code (IED)associated with them in order to demonstrate how much wecan rely on it.Increasingly, co-expression is being used as a means toannotate genes, so one should be careful in not using thisinformation in constructing annotations !

Hugh Shanahan Associative methods in Systems Biology

Page 10: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Outline

1 Outline

2 Gene OntologiesOver-representationSemantic similarity

3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures

4 ValidationDREAM

Hugh Shanahan Associative methods in Systems Biology

Page 11: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Over-representation

One of the most useful tools to hand when one analysesmicro-array data is to ask what functional groupings occurmore often than one expects.

NotationN number of genes in the genome.n number of genes which have been found to bedifferentially expressed.m number of genes in the genome with a specificannotation.k number of genes which are differentially expressedwith the same annotation.

Hugh Shanahan Associative methods in Systems Biology

Page 12: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Probabilities

One can derive the probability Pk that k genes would befound by chance amongst n genes using thehypergeometric probability distribution and the aboveparameters.For the record

Pk =mCk

N−mCn−kNCn

, (1)

NCm =N!

(N − n)!n!. (2)

Hugh Shanahan Associative methods in Systems Biology

Page 13: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Difficulties

There are thousand’s of possible GO terms and oneshould adjust the probabilities to deal with multiplehypotheses.Applying Bonferroni correction using all GO terms givesa p-value of 10−7 equivalent to 1% significence.Because of the structure of the GO terms theseprobabilities are highly correlated with each other.In all these cases focussing on as short a list ofpossible biological processes as possible will minimisethe above difficulties.

Hugh Shanahan Associative methods in Systems Biology

Page 14: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Outline

1 Outline

2 Gene OntologiesOver-representationSemantic similarity

3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures

4 ValidationDREAM

Hugh Shanahan Associative methods in Systems Biology

Page 15: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

What genes match

In benchmarking methods to infer interactions betweengene products, we expect genes that interact to have similarGO terms, though perhaps not entirely the same.Semantic Similarity is a means to measure how similar theannotations of two genes are (0 being no similarity, 1meaning total similarity).GO provides us with a means to do this in a quantitativefashion.

Hugh Shanahan Associative methods in Systems Biology

Page 16: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Simple implementation

Determine the ratio of the number of nodes two genes sharewith the total number of nodes they have between them.

GOsimUI =#{N(G1) ∩ N(G2)}#{N(G1) ∪ N(G2)}

(3)

N(G1) being the set of nodes associated with G1’sannotation.

More elaborate methods are available.Hugh Shanahan Associative methods in Systems Biology

Page 17: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Outline

1 Outline

2 Gene OntologiesOver-representationSemantic similarity

3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures

4 ValidationDREAM

Hugh Shanahan Associative methods in Systems Biology

Page 18: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Motivation

Yesterday, encountered clustering.Hypothesis 1 (weak) :- coexpression implies involvmentin the same process.Expand to many different experiments.Hypothesis 2 (strong) :- greater a level of association,greater the chance of interaction.Hypothesis 2 is often referred to as “guilt byassociation”.Association may tell us about interactions betweengene products. It does not tell us anything about theregulation mechanism.

Hugh Shanahan Associative methods in Systems Biology

Page 19: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

http://www.arabidopsis.leeds.ac.uk/act/index.php

266841_at AT2G26150heat shock transcription factor family protein contains Pfam profile:PF00447 HSF-type DNA-binding domain260978_at AT1G53540

17.6 kDa class I small heat shock protein

Hugh Shanahan Associative methods in Systems Biology

Page 20: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

What do we mean by association ?

Knowing something about the expression level of one gene(over many different experiments) means we knowsomething about the expression level of the other.Replotting the above

Hugh Shanahan Associative methods in Systems Biology

Page 21: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Outline

1 Outline

2 Gene OntologiesOver-representationSemantic similarity

3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures

4 ValidationDREAM

Hugh Shanahan Associative methods in Systems Biology

Page 22: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Linear Correlationcoexpression

Simplest form of association.Assume that there is a linear relationship betweengenes.Formally :-

y1 = a12 + c12y2 + η12 , (4)

y1, y2 are (log) expression levelsη12 noise term.a12, c12 parameters to be determined.

But we’re not interested in that !We are interested in asking how good a model is thisfor this pair of genes ?

Hugh Shanahan Associative methods in Systems Biology

Page 23: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Covariance

Can estimate how good the linear model is by computing

E((y1 − y1)(y2 − y2)) ,

where y1, y2 = E(y1),E(y2) are the means of y1 and y2.E means the expectation value of the above (think of itfor now as taking the average over all the points in theprevious figure).Can prove to oneself (exercise) that the magnitude ofthe covariance is largest when y1 can be perfectlyexpressed as a linear function of y2.The covariance is zero when there is no relationship atall between y1 and y2.

Hugh Shanahan Associative methods in Systems Biology

Page 24: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

●●

●●

●●

●●

●●

●●●

●●

0.6 0.8 1.0 1.2 1.4 1.6 1.8

−1

01

2

y1

y2

Maximum covariance

● ●

●●

●●

●●

●●

●● ●

●●

−1 0 1 2 3

−1

01

2

y1

y2

Zero covariance

Hugh Shanahan Associative methods in Systems Biology

Page 25: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Correlation

We want to compare every possible pair of genes, so usingthe covariance is not very practical since the maximumcovariance will vary from pair of gene to pair of gene.However,

ρ12 =E((y1 − y1)(y2 − y2))√

E((y1 − y1)2)E((y2 − y2)

2), (5)

is bounded: −1 ≤ ρ12 ≤ 1.

Hugh Shanahan Associative methods in Systems Biology

Page 26: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

How well does it work ?

Number of examples of improved functional annotation.Unannotated gene which is highly correlated with genein a known response implies it is likely to be in thesame response.

Hugh Shanahan Associative methods in Systems Biology

Page 27: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Outline

1 Outline

2 Gene OntologiesOver-representationSemantic similarity

3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures

4 ValidationDREAM

Hugh Shanahan Associative methods in Systems Biology

Page 28: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Difficulty : genes correlate with many other genes, notjust a few.Why ?Suggestion : correlations do not distinguish betweenpotential direct interactions and indirect interactionsbetween gene products.

Hugh Shanahan Associative methods in Systems Biology

Page 29: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Example

Other interactionsA

B

C

D

E

F

B directly interacts with three other genes, but could behighly correlated with others.C and D would be highly correlated with each othereven though they are not directly interacting.

Hugh Shanahan Associative methods in Systems Biology

Page 30: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Artificial example: create randomised data to representexpression of B.Generate two other sets of data (C and D) that are byconstruction highly correlated to the original data set, butare not set out to have a relationship with each other.

●●

●●

●●

●● ●

●● ●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

−4 −2 0 2

−4

−3

−2

−1

01

23

B

C

ρ = 0.98

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−4 −2 0 2

−4

−3

−2

−1

01

23

B

D

ρ = 0.98

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

−4 −3 −2 −1 0 1 2 3

−4

−3

−2

−1

01

23

C

D

ρ = 0.96 (!!!)

Hugh Shanahan Associative methods in Systems Biology

Page 31: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Extending correlations :- partial correlations

Correlations only take pairs of genes into consideration.Partial correlations extends the initial pairwiseregression model introduced in equation 4.

y1 = a1 + b12y2 + b13y3 + · · ·+ b1nyp + η1 . (6)

Again, we are not interested in solving this explicitely.We want to understand the correlation that each one ofthe genes y2 . . . yp has on y1 once we have removedthe effect of all the other genes.We will use the notation PCij to refer to this partialcorrelation.

Hugh Shanahan Associative methods in Systems Biology

Page 32: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Derivation

Computed easily once you have all the correlationsbetween all the genes.

R =

1 ρ12 ρ13 . . .ρ12 1 ρ23 . . .ρ13 ρ23 1 . . ....

......

. . .

, (7)

Covariance matrix C is defined similarly.ρij is the correlation between gene i and gene j .

PCij = −R−1

ij√R−1

ii R−1jj

(8)

Hugh Shanahan Associative methods in Systems Biology

Page 33: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Questions

ρij = ρji - why ?Diagonals are 1 - why ?Exercise :- compute PC using the covariance matrix.

Hugh Shanahan Associative methods in Systems Biology

Page 34: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Artificial example

RBCD =

1.0 0.96 0.980.96 1.0 0.980.98 0.98 1.0

, (9)

PCBCD =

−1.0 −0.01 0.70−0.01 −1.0 0.700.70 0.70 −1.0

. (10)

Hugh Shanahan Associative methods in Systems Biology

Page 35: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Disadvantages of using Partial correlations

High partial correlations no longer tend to go to 1 (or-1).Dependent on ranking.How large should/can p (the number of genesexamined) be ?Taking inverses of matrices should make us jumpy -especially when there is limited data.Problem also dates to computing correlations.

Hugh Shanahan Associative methods in Systems Biology

Page 36: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

“Large p, small n”

Notation

p :- number of variables (in this case, expression of agene)n :- number of measurements (total number of affyslides)

R has of the order p2 (p(p − 1)/2 to be exact)potentially interesting correlations.Could be dealing with of the order 10,0002 variables.Have at best a few thousand measurements per gene :-n ∼ 1000.If p � n, then the definition of equation (5) gives arobust estimate of all those correlations, but that is notwhere we are !

Hugh Shanahan Associative methods in Systems Biology

Page 37: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Artificial example

0 20 40 60 80 100

0200

400

600

p/n = 0.1

eigenvalue

!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

0 20 40 60 80 100

0200

400

600

p/n = 0.5

eigenvalue

!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

0 20 40 60 80 100

0500

1000

1500

p/n = 2

eigenvalue

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

0 20 40 60 80 100

0500

1000

1500

p/n = 10

eigenvalue

!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Figure 1: Ordered eigenvalues of the sample covariance matrix S (thin black line) and

that of an alternative estimator S! (fat green line, for definition see Tab. 1), calculated from

simulated data with underlying p-variate normal distribution, for p = 100 and various ratios

p/n. The true eigenvalues are indicated by a thin black dashed line.

3

Schäfer and Strimmer: Large-Scale Covariance Matrix Estimation

Published by The Berkeley Electronic Press, 2005

Schäfer and Strimmer, Statistical Applications in Geneticsand Molecular Biology, 4, 1, (2005)

Hugh Shanahan Associative methods in Systems Biology

Page 38: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Explanation

Spectrum of eigen-values.Any eigen-value equal to zero means matrix isnon-invertible.Dashed lines - actual eigenvalues.Thin black lines - estimated eigenvalues using equation(5).Green line - improved estimator.In general, if n < p then the correlation/covariancematrix is non-invertible.

Hugh Shanahan Associative methods in Systems Biology

Page 39: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Strategies

Reduce p - cluster data initially, then perform analysison each cluster. (Toh and Harimoto, 2002).Compute lower order partial correlations - compute firstorder partial correlations (Magwene and Kim, 2004).Employ improved estimator of correlations (Schäfferand Strimmer, 2005).These options are not necessarily exclusive.

Hugh Shanahan Associative methods in Systems Biology

Page 40: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Shrinkage estimate - wordy explanation

What is computed in equation (5) is an estimate of thecorrelation based on the available data, not the actualcorrelation if we knew the underlying multi-variatedistribution. They would coincide if we had much greaterstatistics. That said, we can use other estimators ofcorrelation. Statisticians have pointed out that many otherpossible estimators can be used which work better in theregime we lie (large p, small n).Shrinkage estimates attempt to combine different naiveestimates to get an improved estimate. The principal hasbeen around for some time (Stein, 1956) though its use hasincreased significantly in the last ten years.

Hugh Shanahan Associative methods in Systems Biology

Page 41: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Shrinkage estimates - the details

Notation:

C is the actual covariance matrix.C is an estimate of the covariance matrix.In computing C we could either attempt to compute itusing the standard definition (“full”) or assume (forexample) that all the off-diagonal entries are zero(“reduced”).CF for the “full” estimate of covariance matrix.CR for the “reduced” estimate of covariance matrix.

Hugh Shanahan Associative methods in Systems Biology

Page 42: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Mean Square Error

Defining

MSE(C) = E((C− C)2) , (11)= Var(C) + Bias2(C) . (12)

Var(C) = E((C− E(C))2) , (13)Bias(C) = E(C)− C . (14)

(Expectation operator is over the data that we have).

Bias(CF ) is small but Var(CF ) will be large.Var(CR) will be small but Bias(CR) will be large.

Hugh Shanahan Associative methods in Systems Biology

Page 43: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

The problem

Depending on the assumptions we use to estimate thecorrelation/covariance matrix

we can either compute a very poor estimate of theparameters in a very accurate model,or compute a good estimate of the parameters for avery inaccurate model (!)But maybe we can reconcile the two...

Hugh Shanahan Associative methods in Systems Biology

Page 44: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

The problem

Depending on the assumptions we use to estimate thecorrelation/covariance matrix

we can either compute a very poor estimate of theparameters in a very accurate model,or compute a good estimate of the parameters for avery inaccurate model (!)But maybe we can reconcile the two...

Hugh Shanahan Associative methods in Systems Biology

Page 45: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

The problem

Depending on the assumptions we use to estimate thecorrelation/covariance matrix

we can either compute a very poor estimate of theparameters in a very accurate model,or compute a good estimate of the parameters for avery inaccurate model (!)But maybe we can reconcile the two...

Hugh Shanahan Associative methods in Systems Biology

Page 46: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Combining the two

One can combine these two estimates:

C∗ = λCR + (1− λ)CF , (15)0 ≤ λ ≤ 1 , (16)

choosing a λ such that MSE(C∗) is minimised.

Computing λ is normally very expensive.Ledoit and Wolf (2003) came up with a short analyticalway of computing λ; Schäffer and Strimmer modifiedthis for genomic data.We have an R package to do this.

Hugh Shanahan Associative methods in Systems Biology

Page 47: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Final Note

The Schäffer and Strimmer estimate uses the“zero-covariance” low dimensional model for their estimate,but this isn’t necessarily the best choice.Notably, while shrinkage estimates make much ofincorporating information, they don’t explicitely includeBiological information.

Hugh Shanahan Associative methods in Systems Biology

Page 48: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

results

Using E. coli time series data (8 time slices), Schäffer andStrimmer examined 102 genes using correlations andpartial correlations.

aceA

aceB

ahpC

aldA

artJ

artQ

asnA

atpB

atpD

atpE

atpF

atpGatpH

b0725

b1057

b1191

b1583b1963

cchB

cspA

cspG

cstA

cyoD

degS

dnaG

dnaJ

dnaK

eutGfixC

flgC

flgD

folK

ftsJ

gatA

gatB

gatC

gatD

gatZ

glnH glnP

gltA

grpE

hisJ

hns

hupA

hupB

ibpA

ibpB icdA

ilvC

lacA

lacY

lacZ

lpdA

manX

manYmanZ

mopB

nmpC

nuoA

nuoB

nuoC nuoF

nuoH

nuoI

nuoL

nuoM

ompC ompF

ompT

pckA

pspA

pspB

pspC

pspDptsG

pyrB

pyrI

sodA

sucA

sucC

sucD

tnaA

yaeM

ybjZ

yceP

ycgX

yecO

yedE

yfaD

yfiA

ygbD

ygcE

yhdM

yheI

yhfV

yhgIyjbE

yjbO

yjcH

ynaF

yrfH

(a) Shrinkage GGM network

aceA

aceB

ahpC

aldA

artJ

artQ

asnA

atpB

atpD

atpE

atpF

atpG

atpH

b0725

b1057

b1191

b1583

b1963

cchB

cspA

cspG

cstA

cyoD

degS

dnaG

dnaJdnaK eutG

fixC

flgC

flgD

folK

ftsJ

gatA

gatB

gatC

gatD

gatZ

glnH

glnP

gltA

grpE

hisJ

hns

hupAhupB

ibpA

ibpB

icdA

ilvC

lacA

lacY

lacZ

lpdA

manX

manY

manZ

mopB

nmpC

nuoA

nuoB

nuoC

nuoF

nuoH

nuoI

nuoL

nuoM

ompC

ompF

ompT

pckA

pspA

pspB

pspC

pspD

ptsG

pyrB

pyrI

sodA

sucA

sucC

sucD

tnaA

yaeM

ybjZ

yceP

ycgX

yecO

yedE

yfaD

yfiA

ygbD

ygcE

yhdM

yheI

yhfV

yhgI

yjbE

yjbO

yjcH

ynaF

yrfH

(b) Lasso GGM network

aceA

aceB

ahpC

aldA

artJ

artQasnA

atpB

atpD

atpE

atpF

atpG

atpH

b0725

b1057

b1191b1583 b1963

cchB

cspA

cspG

cstA

cyoD

degS

dnaG

dnaJ

dnaK

eutG

fixC

flgC

flgD

folK ftsJ

gatA

gatB

gatC

gatD

gatZ glnH

glnPgltAgrpE

hisJ

hns hupA hupB

ibpA

ibpB

icdA

ilvC lacA lacY

lacZ

lpdA

manX

manY

manZmopB

nmpC

nuoA

nuoB

nuoC

nuoF

nuoH

nuoI

nuoL

nuoM

ompC

ompF

ompT

pckA

pspA pspB pspC

pspD

ptsG

pyrB

pyrI

sodAsucA

sucC

sucDtnaA

yaeM

ybjZ

yceP

ycgX

yecO

yedEyfaDyfiA

ygbD ygcE yhdM yheI

yhfV

yhgI

yjbE

yjbO

yjcH

ynaF

yrfH

(c) Relevance network

Figure 5: Gene networks inferred from the E. coli data by (a) the shrinkage GGM ap-

proach presented in this paper (Tab. 1), (b) the lasso GGM approach by Meinshausen and

Bühlmann (2005), and (c) the relevance network with abs(r) > 0.8. Black and grey edges

indicate positive and negative (partial) correlation, respectively.

23

Schäfer and Strimmer: Large-Scale Covariance Matrix Estimation

Published by The Berkeley Electronic Press, 2005

Correlations(Relevance Network)

aceA

aceB

ahpC

aldA

artJ

artQ

asnA

atpB

atpD

atpE

atpF

atpGatpH

b0725

b1057

b1191

b1583b1963

cchB

cspA

cspG

cstA

cyoD

degS

dnaG

dnaJ

dnaK

eutGfixC

flgC

flgD

folK

ftsJ

gatA

gatB

gatC

gatD

gatZ

glnH glnP

gltA

grpE

hisJ

hns

hupA

hupB

ibpA

ibpB icdA

ilvC

lacA

lacY

lacZ

lpdA

manX

manYmanZ

mopB

nmpC

nuoA

nuoB

nuoC nuoF

nuoH

nuoI

nuoL

nuoM

ompC ompF

ompT

pckA

pspA

pspB

pspC

pspDptsG

pyrB

pyrI

sodA

sucA

sucC

sucD

tnaA

yaeM

ybjZ

yceP

ycgX

yecO

yedE

yfaD

yfiA

ygbD

ygcE

yhdM

yheI

yhfV

yhgIyjbE

yjbO

yjcH

ynaF

yrfH

(a) Shrinkage GGM network

aceA

aceB

ahpC

aldA

artJ

artQ

asnA

atpB

atpD

atpE

atpF

atpG

atpH

b0725

b1057

b1191

b1583

b1963

cchB

cspA

cspG

cstA

cyoD

degS

dnaG

dnaJdnaK eutG

fixC

flgC

flgD

folK

ftsJ

gatA

gatB

gatC

gatD

gatZ

glnH

glnP

gltA

grpE

hisJ

hns

hupAhupB

ibpA

ibpB

icdA

ilvC

lacA

lacY

lacZ

lpdA

manX

manY

manZ

mopB

nmpC

nuoA

nuoB

nuoC

nuoF

nuoH

nuoI

nuoL

nuoM

ompC

ompF

ompT

pckA

pspA

pspB

pspC

pspD

ptsG

pyrB

pyrI

sodA

sucA

sucC

sucD

tnaA

yaeM

ybjZ

yceP

ycgX

yecO

yedE

yfaD

yfiA

ygbD

ygcE

yhdM

yheI

yhfV

yhgI

yjbE

yjbO

yjcH

ynaF

yrfH

(b) Lasso GGM network

aceA

aceB

ahpC

aldA

artJ

artQasnA

atpB

atpD

atpE

atpF

atpG

atpH

b0725

b1057

b1191b1583 b1963

cchB

cspA

cspG

cstA

cyoD

degS

dnaG

dnaJ

dnaK

eutG

fixC

flgC

flgD

folK ftsJ

gatA

gatB

gatC

gatD

gatZ glnH

glnPgltAgrpE

hisJ

hns hupA hupB

ibpA

ibpB

icdA

ilvC lacA lacY

lacZ

lpdA

manX

manY

manZmopB

nmpC

nuoA

nuoB

nuoC

nuoF

nuoH

nuoI

nuoL

nuoM

ompC

ompF

ompT

pckA

pspA pspB pspC

pspD

ptsG

pyrB

pyrI

sodAsucA

sucC

sucDtnaA

yaeM

ybjZ

yceP

ycgX

yecO

yedEyfaDyfiA

ygbD ygcE yhdM yheI

yhfV

yhgI

yjbE

yjbO

yjcH

ynaF

yrfH

(c) Relevance network

Figure 5: Gene networks inferred from the E. coli data by (a) the shrinkage GGM ap-

proach presented in this paper (Tab. 1), (b) the lasso GGM approach by Meinshausen and

Bühlmann (2005), and (c) the relevance network with abs(r) > 0.8. Black and grey edges

indicate positive and negative (partial) correlation, respectively.

23

Schäfer and Strimmer: Large-Scale Covariance Matrix Estimation

Published by The Berkeley Electronic Press, 2005

Partial Correlations (GraphicalGaussian Network)

Hugh Shanahan Associative methods in Systems Biology

Page 49: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Results of comparison

Recover centrality of sucA gene.lacA, lacZ and lacY genes have the largest absolutepartial correlations.

Hugh Shanahan Associative methods in Systems Biology

Page 50: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Outline

1 Outline

2 Gene OntologiesOver-representationSemantic similarity

3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures

4 ValidationDREAM

Hugh Shanahan Associative methods in Systems Biology

Page 51: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

So far, we have concerned outselves with linearrelationships.However, such an approximation may not be valid.Naively, one expects a more non-linear relatiopnshipbetween gene products.For example, typically Transcription Factor - targetinteractions are modelled using Michaelis-Mentonkinetics.Furthermore expression levels are derived after anumber of transformations.

Hugh Shanahan Associative methods in Systems Biology

Page 52: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

One approach: Spearman correlation

Basic idea: use ranks rather than raw data.Use nearly the same definition of linear (Pearson)correlation.Must be careful about ties, i.e. raw data havingprecisely the same value (unlikely for floating pointdata).

Hugh Shanahan Associative methods in Systems Biology

Page 53: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Comparison

Comparison of different measures.Many other methods for non-linear measures are possible,the best known being Mutual Information.

Hugh Shanahan Associative methods in Systems Biology

Page 54: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Outline

1 Outline

2 Gene OntologiesOver-representationSemantic similarity

3 Associative MeasuresHypothesesLinear CorrelationPartial CorrelationNon-linear measures

4 ValidationDREAM

Hugh Shanahan Associative methods in Systems Biology

Page 55: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Modelling of gene interactions

We have only just touched upon methods for inferringinteractions between gene products using transcriptomicdata. Some of the others include the use of

Mutual Information/Spearman Correlations - addressesnon-linearities.Kinetic models - attempt to infer interactions.Boolean Networks - model interactions as circuitry.Petri Nets - Prof. Ming Chen.Bayesian Networks - Dr. Chris Needham.Machine Learning methods -Unsupervised/semi-supervised/supervised learning.Integration of other data sources.. . .

Hugh Shanahan Associative methods in Systems Biology

Page 56: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Explosion of methods

160 Annals of the New York Academy of Sciences

FIGURE 1. The number of publications retrieved from a PubMed search of “PathwayInference” or “Reverse Engineering.” On average, this number has been doubling every twoyears since about 1995. (Color is shown in online version.)

reverse-engineering methods was going to bea key prerequisite to their increasing value tobiology. Indeed, at that time, the field of re-verse engineering biological networks was be-ginning to experience considerable expansion,which generated much confusion about whichmethods were truly valuable from a practicalperspective. Evidence of this trend is shownin Figure 1, where we report the number ofpublications retrieved from a PubMed searchof “Pathway Inference” or “Reverse Engineer-ing.” Figure 1 shows what appears to be anexponential growth, in which citations to thesekey words have been roughly doubling everytwo years for the last decade or so.

While such growth in the number of re-verse engineering–related publications wasfueled by very innovative and elegant compu-tational methods for network reconstruction—arising from physics, computer science, mathe-matics, and engineering—the group shared thefeeling that, ultimately, an algorithm’s worthwas to be found in the experimentally vali-dated quality of its predictions. A key problem is

that computational methods can, in the blink ofan eye, generate large numbers of predictions,from a few hundred to hundreds of thousands,most (if not all) of which usually go untested.Even worse, and this would be a best case sce-nario, a very small subsample of predictions—usually three or more, but rarely more thanten—would be validated using sound experi-mental assays and then presented as valuablecriteria for the soundness of the entire set of pre-dictions. Thus, a clear characterization of therelative strengths and weaknesses of the algo-rithms on an objective basis was usually missing.It should be noted that the same could be saidfor high-throughput (and even low-throughput)experimental approaches, whose false positiveand, equally importantly, false negative ratesare rarely considered a requisite for publica-tion. This generated, for instance, more thana few puzzled looks when the first experimen-tally generated, genome-wide interactomes inyeast4–6 showed minimal overlap.

At the spring 2006 meeting, it was agreedthat while the obstacles for the creation of a

Stolovitzsky et al., Ann. N.Y. Acad. Sci. (2009)Hugh Shanahan Associative methods in Systems Biology

Page 57: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Validation - DREAM

While there are a colossal number of methods out there, thevalidation of them is very much in its infancy.DREAM (Dialogue for Reverse Engineering Assessmentsand Methods) is an attempt to deal with this question.Features: Unseen experimental data of (for example)

Transcription Factor bindings sites,artificial data (in silico),genome-wide interactions,

is gathered and groups are invited to reproduce theinteractions. Different groups results are then comparedagainst data to determine how well they did. Newchallenges are presented on an annual basis.

Hugh Shanahan Associative methods in Systems Biology

Page 58: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Some Results from DREAM 2 (2007)

Challenge 1:Identify targets of transcription factor BCL6.

53 genuine targets of BCL6 inferred from unpublishedChP-chip data.147 decoys addes.Task : identify the genuine targets from decoys bypicking genes with similar expression patterns to BCL6.

Hugh Shanahan Associative methods in Systems Biology

Page 59: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Results

Best approaches were selective in the data setsemployed. (Data sets where BCL6 was highlyexpressed or not expressed at all were used.)Semi-supervised learning was employed - using knowntargets of BCL6 to train best method.Used correlations.1st-order Partial correlations did badly.Basic correlations were approaching mostsophisticated approaches.

Hugh Shanahan Associative methods in Systems Biology

Page 60: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

More results from DREAM

Challenge 2: E. Coli network.From RegulonDB good evidence for targets to assortedTranscription Factors.Task : identify targets.Results

Best method used Mutual Information and ideas behind1st order partial correlation.Correlations and Partial Correlations were not too farbehind.Level of identification of targets was low - perhaps 5%.

Hugh Shanahan Associative methods in Systems Biology

Page 61: Hugh Shanahan Association Talk

Associativemethods inSystemsBiology

HughShanahan

Outline

GeneOntologiesOver-representation

Semantic similarity

AssociativeMeasuresHypotheses

Linear Correlation

Partial Correlation

Non-linear measures

ValidationDREAM

Conclusions

GO terms allow us to handle large amounts ofannotation in a structured fashion.Associative measures are a first attempt at using thehuge amounts of expression data that is out there.Very simple ideas such as correlation work surprisinglywell (or rather more complicated methods ofassociation don’t give orders of magnitude betterperformance).A long way to go nonetheless.The type of expression data we select;a clear understanding of what microarray/RNA-seq/...technology;may be even more important.

Hugh Shanahan Associative methods in Systems Biology