basic methodologies of analysis: supervised analysis: hypothesis testing using clinical information...

BASIC METHODOLOGIES OF ANALYSIS:

SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.)

IDENTIFY DIFFERENTIATING GENES

Basic methodologies1

SUPERVISED METHODS CAN ONLY VALIDATE OR

REJECT HYPOTHESES. CAN NOT LEAD TO DISCOVERY

OF UNEXPECTED PARTITIONS

. UNSUPERVISED: EXPLORATORY ANALYSIS

•NO PRIOR KNOWLEDGE IS USED

• EXPLORE STRUCTURE OF DATA ON THE BASIS OF

CORRELATIONS AND SIMILARITIES

Expression 1-99%

20 40 60 80 100 120 140

100

200

300

400

500

600

700

800

900

1000

AIMS: ASSIGN PATIENTS TO GROUPS ON THE BASIS OF THEIR EXPRESSION PROFILES

IDENTIFY DIFFERENCES BETWEEN TUMORS AT DIFFERENT STAGES

IDENTIFY GENES THAT PLAY CENTRAL ROLES IN DISEASE PROGRESSION

EACH PATIENT IS DESCRIBED BY 30,000 NUMBERS: ITS EXPRESSION PROFILE

UNSUPERVISED ANALYSIS

•GOAL A: FIND GROUPS OF GENES THAT HAVE

CORRELATED EXPRESSION PROFILES. THESE GENES ARE

BELIEVED TO BELONG TO THE SAME BIOLOGICAL

PROCESS.

•GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR

GENE EXPRESSION PROFILES. THESE TISSUES ARE

EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL)

STATE.

CLUSTERING, SORTING

Unsupervised analysis

Giraffe

DEFINITION OF THE CLUSTERING PROBLEM

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

LINEAR ORDERING OF DATA

Giraffe + Okapi

BUT WHAT ABOUT THE OKAPI ?

STATEMENT OF THE PROBLEM

GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D

- DIMENSIONAL SPACE, IDENTIFY THE

UNDERLYING STRUCTURE OF THE DATA.

AIMS:PARTITION THE DATA INTO M CLUSTERS,

POINTS OF SAME CLUSTER - "MORE SIMILAR“

M ALSO TO BE DETERMINED!

GENERATE DENDROGRAM,

IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS

"ILL POSED": WHAT IS "MORE SIMILAR"?

RESOLUTION

CLUSTER ANALYSIS YIELDS DENDROGRAM

Dendrogram2

T

STABILITY

T

LINEAR ORDERING OF DATA

YOUNG OLD

• CENTROID (REPRESENTATIVE)

–SELF ORGANIZED MAPS (KOHONEN 1997;

(GENES: GOLUB ET. AL., SCIENCE 1999)

–K-MEANS (GENES; TAMAYO ET. AL., PNAS 1999)• AGGLOMERATIVE HIERARCHICAL

- AVERAGE LINKAGE (GENES: EISEN ET. AL., PNAS 1998)

•PHYSICALLY MOTIVATED

–DETERMINISTIC ANNEALING (ROSE ET. AL.,PRL 1990;

GENES: ALON ET. AL., PNAS 1999)

–SUPER-PARAMAGNETIC CLUSTERING (SPC)(BLATT ET.AL.

GENES: GETZ ET. AL., PHYSICA 2000,PNAS 2000)

--COUPLED MAPS (ANGELINI ET. AL., PRL 2000)

CLUSTERING METHODS

Clustering methods

•INFORMATION THEORY

–AGGLOMERATIVE INFORMATION BOTTLENECK (TISHBY ET. AL.)

•LINEAR ALGEBRA

–SPECTRAL METHODS (MALIK ET. AL.)

•MULTIGRID BASED METHODS (BRANDT ET. AL., )

CLUSTERING METHODS (Cont)

Clustering methods

Centroid methods – K-means

i = 1,...,N DATA POINTS, AT Xi

= 1,...,K CENTROIDS, AT Y

ASSIGN DATA POINT i TO CENTROID ; Si =

COST E:

E(S1 , S2 ,...,SN ; Y1 ,...YK ) =

MINIMIZE E OVER Si , Y

2

1 1

))(,(

YXS i

N

i

K

i

K-means

“GUESS” K=3

K-means

Iteration = 0

•Start with random positions of centroids.

K-means

Iteration = 1


•Assign each data point to

closest centroid

K-means

Iteration = 1

• Start with random

positions of centroids.

• Assign each data point to

closest centroid

• Move centroids to center

of assigned points

K-means; algorithm to find minima

Iteration = 3


•Assign each data point to

closest centroid

•Move centroids to center

of assigned points

•Iterate till minimal cost

E=Total Sum of Squares vs K

• Result depends on initial centroids’ position

• Fast algorithm: compute distances from data points to centroids

• O(N) operations (vs O(N2))

• Must preset K

• Fails for non-spherical distributions

K-means - Summary

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Dendrogram

The dendrogram induces a linear ordering of the data points

The dendrogram induces a linear ordering of the data points

at each step merge pair of nearest clustersinitially – each point = cluster

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS

2 flat clouds

2 FLAT CLOUDS - SINGLE LINKAGE WORKS

average linkage

filament

SINGLE LINKAGE SENSITIVE TO NOISE

filament with one point removed

SINGLE LINKAGE SENSITIVE TO NOISE

Hierarchical Clustering -Summary

• Results depend on distance update method

• Greedy iterative process

• NOT robust against noise

• No inherent measure to identify stable clusters

Average Linkage – the most widely used clustering method in gene expression analysis

nature 2002 breast cancer

how many clusters?

3 LARGEMANY small

SuperParamagneticClustering (SPC)

toy problem SPC

other methods

Graph based clustering

Undirected graph: Vertices (nodes). Edges.

A cut.

J i,j

i

j

Graph based clustering (cont.)i=1,2,...N data points = vertices (nodes) of graph

J i,j – weight associated with edge i,j

5

1

8

J 5,8

J i,j depends on distance D i,j

J i,j

D i,j

A cut in the graph represents a clustering solution (partition).

= cut edge

high cost

(high resolution)

low cost

(low resolution)

COST OF A CUT, I.E, PARTITION = WEIGHTS OF ALL CUT EDGES

highest cost =

sum of all edges.

each point is a cluster

lowest cost = 0

One cluster.

Conclusion –minimization/maximization of the cost are meaningless

Clustering: The SPC spiritM.Blatt, S.Weisman and E.Domany (1996)

• SPC’s idea – consider ALL cuts, i.e. partitions {S}.

• Each partition appears with probability p({S}).

• Measure the correlation between points i,j connected by an edge, over all partitions:

Cij = probability that the edge i-j was NOT cut.

{S}1: p({S}1)

i

j

i

j

i

j

i

j

{S}2: p({S}2) {S}3: p({S}3) {S}4: p({S}4)

Cij = p({S}2)+ p({S}3)+ p({S}4)

Clustering: The SPC spirit (cont) • We have a graph, whose edge values are the correlations

0.45

0.75

0.75

1 0.8

0.90.2

0.45

0.85 0.7

0.7 0.70.9

1

• Create the clustering solution by deleting edges for which Cij < 0.5

What is p({S}) ?

• COST OF {S} = H({S}) CORRESPONDS TO THE RESOLUTION

• SOUNDS REASONABLE TO FIND A SOLUTION FOR EACH

VALUE OF THE COST/RESOLUTION, E.

• FIX H=E, AND GENERATE PARTITIONS FOR WHICH H({S})=E.

• P({X}) = 1/(# PARTITIONS WITH H({S})=E), IF H({X})=E

0 OTHERWISE

What is p({S}) ? (Cont)

• Due to computational issues it is easier to generate partitions for with

an AVERAGE cost E:

INSTEAD OF FINDING PARTITIONS WITH H({S})=E

FIND PARTITIONS WITH<H{S}>=E

P({X})=exp [-H({X})/T ] /Z

Boltzmann distribution T is the temperature = the resolution parameter

Outline of SPC

• Go over resolutions T (minT to maxT is steps of deltaT):•Generate thousands (Cycles) of partitions withaverage cost that corresponds to the current resolution.•Calculate pair correlations : Ci,j(T).•Clusters(T): connected components of C i,j > 0.5

• Map data to a graph G.

• Example: N=4800 points in D=2

Super-Paramagnetic Clustering (SPC)

Output of SPC

Size of largest clusters as function of T

Size of largest clusters as function of T

DendrogramDendrogram

Stable clusters “live” for large T

Stable clusters “live” for large T

A function (T) that peaks when stable clusters break

A function (T) that peaks when stable clusters break

Identify the stable clusters

Same data - Average Linkage

Examining this cluster

Examining this cluster

No analog to (T)No analog to (T)

Advantages of SPC

• Scans all resolutions (T)

• Robust against noise and initialization -calculates collective correlations.

• Identifies “natural” () and stable clusters (T)

• No need to pre-specify number of clusters

• Clusters can be any shape

• Can use distance matrix as input (vs coordinates)

basic methodologies of analysis: supervised analysis: hypothesis testing using clinical information...

Documents

y slide

resolution slide

closest centroid slide

clustering problem slide

expression profile slide

centroid methods

minimal cost slide

genes tamayo et