4. gene expression data analysis

4. Gene Expression Data Analysis

EECS 600: Systems Biology & BioinformaticsInstructor: Mehmet Koyuturk

Analyzing Gene Expression Data


EECS 600: Systems Biology & Bioinformatics

2

Clustering How are genes related in terms of their expression

under different conditions? Differential gene expression

Which genes are affected by change in condition, tissue, disease?

Classification (supervised analysis) Given expression profile for a gene, can we assign a

function? Given the expression levels of several genes in a

sample, can we characterize the type of sample (e.g., cancerous or normal)?

Regulatory network inference How do genes regulate each others expression to

orchestrate cellular function?

Clustering



3

Group similar items together Clustering genes based on their expression

profiles We can measure the expression of multiple genes

in multiple samples Genes that are functionally related should have

similar expression profiles Gene expression profile

A vector (or a point) in multi-dimensional space, where each dimension corresponds to a sample

Clustering of multi-dimensional real-valued data is a well-studied problem

Motivating Example



4

Expression levels of 2,000 genes in 22 normal and 40 tumor colon tissues (Alon et al. , PNAS,

1999)

Applications of Clustering



5

Functional annotation If a gene with unknown function is clustered

together with genes that perform a particular function, then that is likely to be associated with that function

Identification of regulatory motifs If a group of genes are co-regulated, then it is

likely that their regulation is modulated by similar transcription factors, so looking for common elements in the neighborhood of the coding sequences of genes in a cluster, we can identify regulatory motifs and their location (promoters)

Modular analysis

Gene Expression Matrix



6

m g

en

es

n samples Generally, m >> n

m = O(103) n = O(101)

Each row is an n-dimensional vector

Expression profile

Tiniii

ij

eeee

njmieE

],...,,[

1 ,1 ],[

21

Proximity Measures



7

How do we decide which genes are similar to each other?

Euclidian distance

Manhattan distance

n

kjkikjiji eeeeeeEuclidian

1

2

2)(),(

| |),(1

1 jk

n

kikjiji eeeeee tanManhat

Distance



8

Minkowski distance General version of Euclidian, Manhattan etc.

p is a parameter

n

k

pjkikpjiji eeeeeeMinkowski

1

)(),(

jkiknk

ji eeee 1

max

Normalization



9

If we want to measure the distance between directions rather than absolute magnitude, it may be necessary to standardize mean and variation of expression levels for each gene

i

iikik

Tiniii

n

kiikii

n

kikii

eeeeee

en

e

en

e

'''2

'1

'

1

2

1

,],...,,[

)(1

)(

1)(

Correlation



10

The similarity between the variation of two random variables

A vector is treated as sampling of a random variable

Covariance

2

1

],[][

))((1

],[

ijii

n

kjjkiikji

eeCoveVar

een

eeCov

Pearson Correlation Coefficient



11

Pearson correlation coefficient

Pearson correlation is equal to the cosine of the angle (or inner product of) normalized expression profiles

Pearson correlation is normalized

ji

n

kjjkiik

ji

jiji

ee

eVareVar

eeCoveePearson

1

))((

][][

],[),(

1),(1 ji eePearson

),(),( ''jiji eePearsoneePearson

Euclidian Distance & Correlation



12

Euclidian distance (normalized) and Pearson correlation coefficient are closely related

These are the two most commonly used proximity measures in gene expression data analysis

Without loss of generality, we will use to denote the distance between two expression profiles

)),( 1(2),( ''jiji eePearsonneeEuclidian

),( jiij ee

Other Measures of correlation



13

Pearson is vulnerable to outliers If two genes have very high expression in a single

profile, it might dominate to show that the two expression levels are highly correlated

Jackknife correlation: Estimate n correlations by taking each dimension (sample) out, take the minimum among them

Pearson is not robust for non-Gaussian distributions Spearman’s rank order correlation coefficient: Rank

expression levels, replace each expression level with its rank

More robust against outliers A lot of loss of information

Clustering Methods



14

Hierarchical clustering Group genes into a tree

(a.k.a, dendrogram), so that each branch of the tree corresponds to a cluster

Higher branches correspond to coarser clusters

Partitioning Partition genes into several

groups so that similar genes will be in the same partition

Hierarchical clustering



15

Direction of clustering Bottom-up (agglomerative): Start from individual

genes, join them into groups until only one group is left

Top-down (divisive): Start with one group consisting of all genes, keep partitioning groups until each group contains exactly one gene

Agglomerative clustering is computationally less expensive Why?

Hierarchical clustering methods are greedy Once a decision is made, it cannot be undone

Agglomerative clustering



16

Start with m clusters: Each cluster contains one gene

At each step, choose two clusters that are closest (or most correlated), merge them

How do we evaluate the distance between two clusters? Single-linkage: If clusters contain two very close

genes, than the clusters are close to each other)(min),(

,ij

CjCilk

lk

CC

Agglomerative Clustering



17

Complete linkage: Two clusters are close to each other only if all genes inside them are close to each other

Group average: Two clusters are close to each other if their centers are close to each other

k lCi Cj

ijlk

lk CCCC 1

),(

)(max),(,

ijCjCi

lklk

CC

Divisive Clustering



18

Recursive bipartitioning Find an “optimal” partitioning of the genes into two

clusters Recursively work on each partition Since the number of clusters is an issue for partitioning

based clustering algorithms, the magic number 2 solves a lot of problems

May be computationally expensive The problem is “global” At every level of the tree, we have to work on all of the

genes If tree is imbalanced, there might be as many as m

levels With a reasonable stopping criterion, maybe

considered a partition-based clustering as well

Partition Based Clustering



19

Find groups of genes such that genes in each group are similar to each other, while being somewhat less similar to those in other clusters

Easily interpratable Especially, for large datasets (as compared to

hierarchical)

Number of Clusters



20

Clustering is “unsupervised”, so generally we do not have prior knowledge on how many clusters underly the data

It is very difficult to partition data into an “unknown” number of clusters

Most algorithms assume that K (number of clusters) is known

Try different values of K, find the one that results in best clustering

Very expensive

Overlapping vs. Disjoint Clusters



21

Genes do not have a single function Most genes might be involved in

different processes, so their expression profiles might demonstrate similarities with different genes in different contexts

Can we allow a gene to be included in more than one cluster?

Allowing overlaps between clusters poses additional challenges To what extent do we allow overlaps?

(We definitely don’t want to identify two identical clusters)

Fuzzy Clustering



22

Assign weights to each gene-cluster pair, showing the extent (or likelihood) of the gene belonging to the cluster Difficult interpretation Partitioning is a special case of fuzzy clustering,

where the weights are restricted to binary values Hierarchical clustering is also “fuzzy” in some

sense Continuous relaxation might alleviate

computational complexity as well

K-Means Clustering



23

The most famous clustering algorithm Given K, find K disjoint clusters such that the

total intracluster variation is minimized

kCi

ik

k eC

1

kCi

iik e ),(

K

kk

1

Cluster mean:

Intracluster variation:

Total intracluster variation:

K-Means Algorithm



24

K-Means is an iterative algorithm that alters parameters based on each other’s values until no improvement is possible

1. Choose K expression profiles randomly, designate each of them as the center of one of the K clusters

2. Assign each gene to a cluster2.1. Each gene is assigned to the cluster with closest

center to its profile

3. Redetermine cluster centers4. If any gene was moved, go back to Step 2, else

stop

Sample Run of K-Means



25

Self Organizing Maps



26

Just like K-means, we have K clusters, but this time they are organized into a map Often a 2D grid We want to organize clusters so that similar

clusters will be in proximity in the map A way of visualizing in low-dimensional (2D) space

Just like K-means, each cluster is associated with a weight vector It was the cluster center in K-means

Each weight vector is first initialized randomly to some gene’s expression profile

SOM Algorithm



27

At each step, a gene is selected at random The distance between the gene’s expression

profile and each cluster’s weight vector is calculated, and the cluster with closest weight vector becomes the winner

The winner’s and its neighbors’ (according to the 2D mapping) weight vectors are adjusted to represent the gene’s expression profile better

Cj is the winner cluster for gene i at time t α is a decreasing function of time, θ is the

neighborhood function

))()(,()()()1( ikjkkk etwCCttwtw

Sample SOM Output



28

Gene Co-expression Network



29

Nodes represent genes Weighted edges between nodes represent

proximity (correlation) between genes’ expression profiles

This is indeed a way of predicting interactions between genes

Graph Theoretical Clustering



30

Partition the graph into heavy subgraphs Maximize total weight (number of edges) inside a

cluster Minimize total weight (number of edges) between

clusters Heuristic algorithms

CLICK: Recursive min-cut CAST: Iterative improvement one by one for each

cluster Loss of information?

Model Based Clustering



31

Generating model Each cluster is associated with a distribution (that

generates expression profiles for associated genes) specified by model parameters

The probability that a gene belongs to a cluster is specified by hidden parameters

Expectation Maximization (EM) algorithm Start with a guess of model parameters E-step: Compute expected values of hidden parameters

based on model parameters M-step: Based on hidden parameters, estimate model

parameters to maximize the likelihood of observing the data at hand, iterate

K-means is a special case

Evaluation of Clusters



32

In general, we want to maximize intra-cluster similarity, while minimizing inter-cluster similarity

Homogeneity, separation Based on the proximity metric

Reference partition Information on “true clusters” that comes from a

different source (apart from expression data) Molecular annotation (e.g., Gene Ontology) Jaccard coefficient, sensitivity, specificity

Cluster annotation Processes that are significantly enriched in a cluster

Homogeneity & Separation



33

Heterogeneity (or homogeneity in reverse direction) How similar are the genes in one cluster?

Separation How dissimilar are different clusters?

Good clustering: high heterogeneity, low separation

kCji

ijCCCH

,)1(

2)(

k lCi Cj

ijlk

lk CCCCS 1

),(

Overall Quality



34

Overall heterogeneity

Overall separation

How do these change with respect to number of clusters? Can we optimize these values to choose the best

number of clusters?

kC

kk CHCm

H )(1

lk

lk

CClklk

CClk

CCSCCCC

S,

,

),(1

Bayesian Information Criterion



35

A statistical criterion for evaluating a model Penalizes model complexity (number of free

parameters to be estimated)

k is the number of free parameters in the model, which increases with the number clusters

RSS is the “total error” in the model Trade-off number of clusters and optimization

function to choose the best number of clusters

Reference Partitioning



36

If there is information about “ground truth” from an independent source, we can compare our clustering to such reference partitioning

Pairwise assessment Let Cij = 1 if gene i and gene j are assigned to the

same cluster by the clustering algorithm, 0 otherwise

Let Rij = 1 if gene i and gene j are in the same cluster according to reference partition

jiijij

jiijij

jiijij

jiijij

RCnRCn

RCnRCn

,10

,01

,00

,11

)(

Comparing Partitions



37

Rand index (symmetric)

Jaccard coefficient (sparse)

Minkowski measure (sparse)

01100011

0011

nnnn

nnRand

011011

11

nnn

nJaccard

0111

0110

nn

nnMinkowski

Cluster Annotation



38

Clustering results in groups of genes that are co-expressed (or co-regulated) For each group, can we tell something about the

biological phenomena that underlies our observation (their co-expression)?

We have partial knowledge on the function of many individual genes Gene Ontology, COG (Clusters of Ortholog Groups),

PFAM (Protein Domain Families) Taking a statistical approach, we can assign

function to each group of genes A function popular in a cluster is associated with

that cluster

Gene Ontology



39

Ontology: Study of being (e.g., conceptualization) Gene Ontology is an attempt to develop a

standardized library of cellular function Unified view of life: Processes, structures, and

functions recur in diverse organisms Three concepts of Gene Ontology

Biological process: A recognized series of events or molecular functions (e.g., cell cycle, development, metabolism)

Molecular function: What does a gene’s product do? (e.g., binding, enzyme activity, receptor activity)

Cellular component: Localization within the cell (e.g., membrane, nucleus, ubiquitin ligase complex)

Hierarchy in Gene Ontology



40

Gene Ontology is hierarchical A process might have subprocesses

Seed maturation is part of seed development A process might be described at different levels of

detail Seed dormation is a(n example of) seed maturation

Same for function and component Gene Ontology terms are related to each other

via “is a” and “part of” relationships If process A is part of process B, then A is B’s child

(B is A’s parent); B involves A If function C is a function D, then C is D’s child; C is

a more detailed specification of D



41

GO Hierarchy is a DAG



42

Gene Ontology is hierarchical, but the hierarcy is not represented by a tree, it is represented by a directed acyclic graph (DAG) A GO term can have

multiple parents (and obviously a GO term might (should?) have multiple children)

Annotation



43

GO-based annotation assigns GO terms to a gene A gene might have multiple functions, can be

involved in multiple processes Multiple genes might be associated with the same

function, multiple genes take part in a process True-path rule

If a gene is annotated with a term, then it is also annotated by its parents (consequently, all ancestors)

How does the number of genes associated with each term changes as we go down on the GO DAG?

GO Annotation of Gene Clusters



44

There a |C| genes in a cluster C |T| genes are associated with GO term t |C ∩ T| genes are in C and are associated with

t What is the association between cluster C and

term t? If we chose random clusters, would we be able to

observe that at least this many (|C ∩ T|) of the |C| genes in C are associated with t?

What is the probability of this observation? Statistical significance based on

hypergeometric distribution

Hypergeometric Distribution



45

We have n items, m of which are good If we choose r items from the entire set of items

at random, what is the probability that at least k of them will be good?

n is the number of genes in the organism m=|T|, r=|C|, k= |C ∩ T| The lower p is, the more likely that there is an

underlying association between the term and the cluster (the term is significantly enriched in the cluster)

),min(

][rm

ki

r

n

ir

mn

i

m

kKPp

GO Hierarchy & Cluster Annotation



46

How specific (general) is the annotation we attach to a cluster? If a cluster is larger, then it might correspond to a

more general process Some processes might be over-represented in the

study set How do we find the best location of a cluster in GO

hierarchy? Parent-child annotation

Condition probability of enrichment of a term in a cluster on the enrichment of its parent terms in the cluster

The gene space is defined as the set of genes that are associated with t’s parents

Parent-Child Annotation



47

Multiple Hypotheses Testing



48

The p-value for a single term provides an estimate of the probability of having the observed number of genes attached to that particular term We have many terms, even if the likelihood of

enrichment is small for a particular term, it might be very probable that one term will be enriched as much as observed in the cluster

We have to account for all hypotheses being tested simultaneously

Bonferroni correction: Apply union rule, add all p-values

Which terms should we consider while correcting for multiple hypotheses for a single term?

Representativity of Terms



49

How good does a significantly enriched term represent a cluster? How many of the genes in the cluster are attached

to the term? How many of the genes attached to the term are

in the cluster? For term t that is significantly enriched in

cluster C Specificity: |C ∩ T|/|C|, a.k.a. precision Specificity: |C ∩ T|/|T|, a.k.a. recall

Biclustering



50

A particular process might be active in certain conditions A group of genes

might be expressed (or up-regulated, supressed, co-regulated, etc.) in only a subset of samples

They might behave almost independently under other conditions

Clustering vs. Biclustering



51

Clustering is a global approach Each gene is a point in the space defined by all

samples How about points that are clustered in a subspace?

Biclustering: While clustering genes, also choose a set of dimensions (samples) that provides best clustering and vice versa a.k.a, co-clustering, subspace clustering… This is a much harder problem, because you are not

only trying to find groups of points that are close to each other in multi-dimensional space, but also trying to identify a subspace in which groups are more evident

Biclustering Applications



52

Sample/tissue classification for diagnosis The samples with leukemia show specific characters

for a subset of genes Identification of co-regulated genes

Certain sets of genes exhibit coherent activations under specific conditions (while behaving more or less arbitrarily with respect to each other under other conditions)

Functional annotation Biological processes, functional classes are

overlapping Different sets of samples reveal different functional

relationships

Biclustering Principles



53

A cluster of genes is defined with respect to a cluster of samples and vice versa

The clusters are not necessarily exclusive or exhaustive A gene/condition may belong to more than one

cluster A gene/condition may not belong to any cluster at

all Biclusters are not “perfect”

Noise Statistical inference becomes particularly

important

Biclustering Formulation



54

Given a gene expression matrix A with gene set G and sample set S, a bicluster is defined by a subset of genes I and a subset of samples J

General idea: A bicluster is a “good” one if AIJ , the submatrix defined by I and J, has some coherence (low variance, low rank, similar ordering of rows, etc.)

The biclustering problem can be defined as one of finding a single bicluster in the entire gene expression matrix, or as one of extracting all biclusters (with some restriction on the relationship between biclusters)

Coherence of a Submatrix



55

Distribution of Biclusters



56

Bipartite Graph Model



57

Just like symmetric matrices, which can be modeled as arbitrary graphs, rectangular matrices can be modeled using bipartite graphs

With proper definition of edge weights, biclustering can be posed as the problem of finding “heavy” subgraphs

Row, Column, Matrix Means



58

Objective Function



59

Low-variance (constant) bicluster Ideal bicluster: Minimize bicluster variance

Low-rank (constant row, constant column, coherent values) bicluster Ideal constant row: Ideal constant column: General rank-one bicluster: Define residue for each value: Minimize mean squared residue

Missing Values



60

Not all expression levels are available for each gene/sample pair A solution is to replace missing values (random

values, gene mean, sample mean, regression) Generalize definition row, column, and

bicluster means to handle missing values implicitly Occupancy threshold:A bicluster is one with adequate number of (non-missing) values in each row and column

Overlapping Biclusters



61

The expression of a gene in one sample may be thought of as a superposition of contribution for multiple biclusters

Plaid model: : contribution of bicluster k on the expression

value of the ith gene in the jth sample and (generally binary) specify the membership

of row i and column j in the kth bicluster, respectively

Minimize

is defined to reflect “bicluster type” , , ,

Discrete Coherence



62

A bicluster is defined to be one with coherent ordering of the values on rows and/or columns (as compared to values themselves)

Order-preserving submatrix (OPSM) A submatrix is order preserving if there is an

ordering of its columns such that the sequences of values in every row is increasing

Gene expression motifs (xMOTIFs) The expression level of a gene is conserved across

a subset of conditions if the gene is in the same “state” in each of the conditions

An xMOTIF is a subset of genes that are simultaneously conserved across a subset of samples

Binary Biclusters



63

Quantize gene expression matrix to binary values SAMBA: A 1 corresponds to a significant change in the

expression value PROXIMUS: A 1 means that the gene is “expressed” in

the corresponding sample A bicluster is a “dense submatrix”, i.e. one with

significantly more number of 1’s than one would expect Bipartite graph model: Bicliques, heavy subgraphs It is possible to statistically quantify the density of a

submatrix Log-likelihood:

p-value:

Biclustering Algorithms



64

Enumeration Go for it!

Greedy algorithms Make a locally optimal choice at every step

Divide and conquer Solve problem recursively

Alternating iterative heuristics Fix one dimension, solve for other, alternate

iteratively Model Based Parameter estimation

e.g., EM algorithm

Enumerating Biclusters



65

m rows, n columns in the matrix 2m X 2n possible biclusters in total Not doable in realistic amounts of time Is it really necessary?

Put some restriction on size of biclusters SAMBA models the problem as one of finding

heavy subgraphs in a bipartite graph Key assumption is sparsity: Nodes of the bipartite

graph have bounded degree Find K heavy bipartite subgraphs (biclusters) with

bounded degree enumeration Refine them to optimize overlap and add/remove nodes

that improve bicluster quality

Greedy Algorithms



66

Basic idea: Refine existing biclusters by adding/removing genes/samples to improve the objective function Generally, quite fast How to choose initial biclusters? How to jump over bad local optima? (Global awareness,

Hill-climbing) Optimization function: mean-squared residue

Node deletion: Start with a large bicluster, keep removing genes/samples that contribute most to total residue

Node addition: Start with a small bicluster, keep adding genes/samples that contribute least to total residue

Repeat these alternatingly to improve global awareness

Finding All Biclusters



67

If biclusters are identified one by one, we should make sure that we do not identify the same bicluster again and again Masking discovered biclusters: Fill bicluster with

random values First identify disjoint biclusters, then grow them to

capture overlaps Flexible Overlapped Biclustering (FLOC)

Generate K initial biclusters Make decision from the gene/sample perspective

(as compared to bicluster perspective): Choose the best (maximum gain) action for each gene

Generalizing K-Means to Biclustering



68

Assume K gene clusters, L sample clusters Notice that this is a little counter-intuitive, we do

not have well-defined biclusters, we rather have clusters of genes and samples, and each pair of gene and sample clusters defines a bicluster

R: mxk gene clustering matrix, C: nxl sample clustering matrix R(i,k)=1 if gene i belongs to cluster k (actually,

columns are normalized to have unit norm) Minimize total residue:

KL-Means Algorithm



69

We can show that Batch iteration

Given R, compute (mxl matrix) serves as a prototype for column

clusters For each column, find the column of that is

closest to that column, update the corresponding entry of C accordingly

Once C is fixed, repeat the same for rows to compute R from

Converges to a local minimum of the objective function

OPSM Algorithm Recall that an order preserving submatrix (OPSM)

is one such that all rows have their entries in the same order

Growing partial models Fix the extremes first The idea: Columns with very high or low values are

more informative for identifying rows that support the assumed linear order

Start with all (1,1) partial models, i.e., only consider the preservation of the first and last elements, keep the best ones

Expand these to obtain (2,1) models, then (2,2) until we have (s/2, s/2) models, s being the number of columns in target bicluster



70

Divide and Conquer Algorithms Block clustering (a.k.a., Direct clustering)

Recursive bipartitioning Sort rows according to their mean, choose a row such

that the total variance above and below the row is minimized

Do the same for columns Pick the row or column that results in minimum intra-

cluster variances, split matrix into two based on that row or column

Continue splitting recursively One problem is that once two rows/columns go to

different biclusters, they can never come together Gap Statistics: Find a large number of biclusters, then

recombine



71

Binormalization Normalize matrix on both dimensions Independent scaling of rows and columns

Here, R and C are diagonal matrices that contain row

and column means, respectively Bistochastization

Goal: Rows will add up to a constant (or will have constant norm), columns will add up to a separate constant

Repeat independent scaling of rows and columns until stability is reached

The residual of entire matrix is also normalized in the sense that both rows and columns have zero mean



72

Spectral Biclustering Singular value decomposition

The eigenvalues of the matrices ATA and AAT (say, σ2) are the same

Each σ is called a singular value of A and the corresponding left and right eigenvectors are called singular vectors

If σ1 is the largest singular vector of A such that ATAv1 = σ1v1 and AATu1 = σ1u1 , then σ1u1v1

T is the best rank-one approximation to A, i.e., ||A- σuvT ||2 is minimized by σ1 , u1 , and v1

(over all orthogonal vector pairs with unit norm)

Consequently, the entries of u and v are ordered in such a way that similar rows have similar values on u, similar columns have similar values on v Split matrix based on u and v



73

4. gene expression data analysis

Documents

expression of multiple

expression profileswe

variation of expression

group of genes

coding sequences of

unknown function

cellular function

particular function