basic methodologies of analysis: supervised analysis: hypothesis testing using clinical information...
TRANSCRIPT
BASIC METHODOLOGIES OF ANALYSIS:
SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.)
IDENTIFY DIFFERENTIATING GENES
Basic methodologies1
SUPERVISED METHODS CAN ONLY VALIDATE OR
REJECT HYPOTHESES. CAN NOT LEAD TO DISCOVERY
OF UNEXPECTED PARTITIONS
. UNSUPERVISED: EXPLORATORY ANALYSIS
•NO PRIOR KNOWLEDGE IS USED
• EXPLORE STRUCTURE OF DATA ON THE BASIS OF
CORRELATIONS AND SIMILARITIES
Expression 1-99%
20 40 60 80 100 120 140
100
200
300
400
500
600
700
800
900
1000
AIMS: ASSIGN PATIENTS TO GROUPS ON THE BASIS OF THEIR EXPRESSION PROFILES
IDENTIFY DIFFERENCES BETWEEN TUMORS AT DIFFERENT STAGES
IDENTIFY GENES THAT PLAY CENTRAL ROLES IN DISEASE PROGRESSION
EACH PATIENT IS DESCRIBED BY 30,000 NUMBERS: ITS EXPRESSION PROFILE
UNSUPERVISED ANALYSIS
•GOAL A: FIND GROUPS OF GENES THAT HAVE
CORRELATED EXPRESSION PROFILES. THESE GENES ARE
BELIEVED TO BELONG TO THE SAME BIOLOGICAL
PROCESS.
•GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR
GENE EXPRESSION PROFILES. THESE TISSUES ARE
EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL)
STATE.
CLUSTERING, SORTING
Unsupervised analysis
Giraffe
DEFINITION OF THE CLUSTERING PROBLEM
CLUSTER ANALYSIS YIELDS DENDROGRAM
T (RESOLUTION)
LINEAR ORDERING OF DATA
Giraffe + Okapi
BUT WHAT ABOUT THE OKAPI ?
STATEMENT OF THE PROBLEM
GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D
- DIMENSIONAL SPACE, IDENTIFY THE
UNDERLYING STRUCTURE OF THE DATA.
AIMS:PARTITION THE DATA INTO M CLUSTERS,
POINTS OF SAME CLUSTER - "MORE SIMILAR“
M ALSO TO BE DETERMINED!
GENERATE DENDROGRAM,
IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS
"ILL POSED": WHAT IS "MORE SIMILAR"?
RESOLUTION
CLUSTER ANALYSIS YIELDS DENDROGRAM
Dendrogram2
T
STABILITY
T
LINEAR ORDERING OF DATA
YOUNG OLD
• CENTROID (REPRESENTATIVE)
–SELF ORGANIZED MAPS (KOHONEN 1997;
(GENES: GOLUB ET. AL., SCIENCE 1999)
–K-MEANS (GENES; TAMAYO ET. AL., PNAS 1999)• AGGLOMERATIVE HIERARCHICAL
- AVERAGE LINKAGE (GENES: EISEN ET. AL., PNAS 1998)
•PHYSICALLY MOTIVATED
–DETERMINISTIC ANNEALING (ROSE ET. AL.,PRL 1990;
GENES: ALON ET. AL., PNAS 1999)
–SUPER-PARAMAGNETIC CLUSTERING (SPC)(BLATT ET.AL.
GENES: GETZ ET. AL., PHYSICA 2000,PNAS 2000)
--COUPLED MAPS (ANGELINI ET. AL., PRL 2000)
CLUSTERING METHODS
Clustering methods
•INFORMATION THEORY
–AGGLOMERATIVE INFORMATION BOTTLENECK (TISHBY ET. AL.)
•LINEAR ALGEBRA
–SPECTRAL METHODS (MALIK ET. AL.)
•MULTIGRID BASED METHODS (BRANDT ET. AL., )
CLUSTERING METHODS (Cont)
Clustering methods
Centroid methods – K-means
i = 1,...,N DATA POINTS, AT Xi
= 1,...,K CENTROIDS, AT Y
ASSIGN DATA POINT i TO CENTROID ; Si =
COST E:
E(S1 , S2 ,...,SN ; Y1 ,...YK ) =
MINIMIZE E OVER Si , Y
2
1 1
))(,(
YXS i
N
i
K
i
K-means
“GUESS” K=3
K-means
Iteration = 0
•Start with random positions of centroids.
K-means
Iteration = 1
•Start with random positions of centroids.
•Assign each data point to
closest centroid
K-means
Iteration = 1
• Start with random
positions of centroids.
• Assign each data point to
closest centroid
• Move centroids to center
of assigned points
K-means; algorithm to find minima
Iteration = 3
•Start with random positions of centroids.
•Assign each data point to
closest centroid
•Move centroids to center
of assigned points
•Iterate till minimal cost
E=Total Sum of Squares vs K
• Result depends on initial centroids’ position
• Fast algorithm: compute distances from data points to centroids
• O(N) operations (vs O(N2))
• Must preset K
• Fails for non-spherical distributions
K-means - Summary
52 41 3
Agglomerative Hierarchical Clustering
3
1
4 2
5
Distance between joined clusters
Dendrogram
The dendrogram induces a linear ordering of the data points
The dendrogram induces a linear ordering of the data points
at each step merge pair of nearest clustersinitially – each point = cluster
Need to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Need to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS
2 flat clouds
2 FLAT CLOUDS - SINGLE LINKAGE WORKS
average linkage
average linkage
filament
SINGLE LINKAGE SENSITIVE TO NOISE
filament
SINGLE LINKAGE SENSITIVE TO NOISE
filament with one point removed
SINGLE LINKAGE SENSITIVE TO NOISE
Hierarchical Clustering -Summary
• Results depend on distance update method
• Greedy iterative process
• NOT robust against noise
• No inherent measure to identify stable clusters
Average Linkage – the most widely used clustering method in gene expression analysis
nature 2002 breast cancer
how many clusters?
3 LARGEMANY small
SuperParamagneticClustering (SPC)
toy problem SPC
other methods
Graph based clustering
Undirected graph: Vertices (nodes). Edges.
A cut.
J i,j
i
j
Graph based clustering (cont.)i=1,2,...N data points = vertices (nodes) of graph
J i,j – weight associated with edge i,j
5
1
8
J 5,8
J i,j depends on distance D i,j
J i,j
D i,j
A cut in the graph represents a clustering solution (partition).
= cut edge
high cost
(high resolution)
low cost
(low resolution)
COST OF A CUT, I.E, PARTITION = WEIGHTS OF ALL CUT EDGES
highest cost =
sum of all edges.
each point is a cluster
lowest cost = 0
One cluster.
Conclusion –minimization/maximization of the cost are meaningless
Clustering: The SPC spiritM.Blatt, S.Weisman and E.Domany (1996)
• SPC’s idea – consider ALL cuts, i.e. partitions {S}.
• Each partition appears with probability p({S}).
• Measure the correlation between points i,j connected by an edge, over all partitions:
Cij = probability that the edge i-j was NOT cut.
{S}1: p({S}1)
i
j
i
j
i
j
i
j
{S}2: p({S}2) {S}3: p({S}3) {S}4: p({S}4)
Cij = p({S}2)+ p({S}3)+ p({S}4)
Clustering: The SPC spirit (cont) • We have a graph, whose edge values are the correlations
0.45
0.75
0.75
1 0.8
0.90.2
0.45
0.85 0.7
0.7 0.70.9
1
• Create the clustering solution by deleting edges for which Cij < 0.5
What is p({S}) ?
• COST OF {S} = H({S}) CORRESPONDS TO THE RESOLUTION
• SOUNDS REASONABLE TO FIND A SOLUTION FOR EACH
VALUE OF THE COST/RESOLUTION, E.
• FIX H=E, AND GENERATE PARTITIONS FOR WHICH H({S})=E.
• P({X}) = 1/(# PARTITIONS WITH H({S})=E), IF H({X})=E
0 OTHERWISE
What is p({S}) ? (Cont)
• Due to computational issues it is easier to generate partitions for with
an AVERAGE cost E:
INSTEAD OF FINDING PARTITIONS WITH H({S})=E
FIND PARTITIONS WITH<H{S}>=E
P({X})=exp [-H({X})/T ] /Z
Boltzmann distribution T is the temperature = the resolution parameter
Outline of SPC
• Go over resolutions T (minT to maxT is steps of deltaT):•Generate thousands (Cycles) of partitions withaverage cost that corresponds to the current resolution.•Calculate pair correlations : Ci,j(T).•Clusters(T): connected components of C i,j > 0.5
• Map data to a graph G.
• Example: N=4800 points in D=2
Super-Paramagnetic Clustering (SPC)
Output of SPC
Size of largest clusters as function of T
Size of largest clusters as function of T
DendrogramDendrogram
Stable clusters “live” for large T
Stable clusters “live” for large T
A function (T) that peaks when stable clusters break
A function (T) that peaks when stable clusters break
Identify the stable clusters
Same data - Average Linkage
Examining this cluster
Examining this cluster
No analog to (T)No analog to (T)
Advantages of SPC
• Scans all resolutions (T)
• Robust against noise and initialization -calculates collective correlations.
• Identifies “natural” () and stable clusters (T)
• No need to pre-specify number of clusters
• Clusters can be any shape
• Can use distance matrix as input (vs coordinates)