![Page 1: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/1.jpg)
Finding Local Correlations in High Dimensional Data
USTC Seminar
Xiang ZhangCase Western Reserve University
![Page 2: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/2.jpg)
Finding Latent Patterns in High Dimensional Data
• An important research problem with wide applicationsbiology (gene expression analysis, genotype-
phenotype association study) customer transactions, and so on.
• Common approaches feature selection feature transformation subspace clustering
![Page 3: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/3.jpg)
Existing Approaches
• Feature selection find a single representative subset of features that are
most relevant for the data mining task at hand
• Feature transformation find a set of new (transformed) features that contain the
information in the original data as much as possible Principal Component Analysis (PCA)
• Correlation clustering find clusters of data points that may not exist in the axis
parallel subspaces but only exist in the projected subspaces.
![Page 4: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/4.jpg)
Motivation Example
0362 972 xxx
0523 8651 xxxx
Question: How to find these local linear correlations (using existing methods)?
linearly correlated genes
![Page 5: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/5.jpg)
Applying PCA — Correlated?• PCA is an effective way to determine whether a set
of features is strongly correlated
• A global transformation applied to the entire dataset
a few eigenvectors describe most variance in the dataset small amount of variance represented by the remaining eigenvectors small residual variance indicates strong correlation
![Page 6: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/6.jpg)
Applying PCA – Representation?• The linear correlation is
represented as the hyperplane that is orthogonal to the eigenvectors with the minimum variances
0321 xxx
[1, -1, 1]
0362 972 xxx
0523 8651 xxxx
linear correlations reestablished by full-dimensional PCAembedded linear correlations
![Page 7: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/7.jpg)
Applying Bi-clustering or Correlation Clustering Methods
• Correlation clustering no obvious clustering
structure
• Bi-clustering no strong pair-wise
correlations
linearly correlated genes
![Page 8: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/8.jpg)
Revisiting Existing Work
• Feature selection finds only one representative subset of features
• Feature transformation performs one and the same feature transformation for the
entire dataset does not really eliminate the impact of any original
attributes
• Correlation clustering projected subspaces are usually found by applying
standard feature transformation method, such as PCA
![Page 9: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/9.jpg)
Local Linear Correlations - formalization
• Idea: formalize local linear correlations as strongly correlated feature subsetsDetermining if a feature subset is correlated
small residual variance
The correlation may not be supported by all data points -- noise, domain knowledge…supported by a large portion of the data points
![Page 10: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/10.jpg)
Problem Formalization
• Suppose that F (m by n) be a submatrix of the dataset D (M by N)
• Let { } be the eigenvalues of the covariance matrix of F and arranged in ascending order
• F is strongly correlated feature subset if
i
n
jj
k
ii
1
1 Mmand(1) (2)
total variance
variance on the k eigenvectors having smallest eigenvalues (residue variance)
number of supporting data points
total number of data points
![Page 11: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/11.jpg)
Problem Formalization
• Suppose that F (m by n) be a submatrix of the dataset D (M by N)
n
jj
k
ii
kFf
1
1),(
larger k, stronger correlation
smaller ε, stronger correlation
K and ε, together control the strength of the correlation
Eigenvalue idE
igen
valu
es
larger k smaller ε
![Page 12: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/12.jpg)
Goal
• Goal: to find all strongly correlated feature subsets
• Enumerate all sub-matrices?Not feasible (2M×N sub-matrices in total)Efficient algorithm needed
• Any property we can use?Monotonicity of the objective function
![Page 13: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/13.jpg)
Monotonicity
• Monotonic w.r.t. the feature subset If a feature subset is strongly correlated, all its
supersets are also strongly correlated Derived from Interlacing Eigenvalue Theorem
Allow us to focus on finding the smallest feature subsets that are strongly correlated
Enable efficient algorithm – no exhaustive enumeration needed
'1
'2
'21
'1 nnn
i
'i
![Page 14: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/14.jpg)
The CARE Algorithm
• Selecting the feature subsetsEnumerate feature subsets from smaller size to
larger size (DFS or BFS) If a feature subset is strongly correlated, then its
supersets are pruned (monotonicity of the objective function)
Further pruning possible
![Page 15: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/15.jpg)
Monotonicity
• Non-monotonic w.r.t. the point subsetAdding (or deleting) point from a feature subset
can increase or decrease the correlation among the features
Exhaustive enumeration infeasible – effective heuristic needed
![Page 16: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/16.jpg)
The CARE Algorithm
• Selecting the point subsets Feature subset may only correlate on a subset of
data points If a feature subset is not strongly correlated on
all data points, how to chose the proper point subset?
![Page 17: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/17.jpg)
The CARE Algorithm
• Successive point deletion heuristicgreedy algorithm – in each iteration, delete the
point that resulting the maximum increasing of the correlation among the subset of features
Inefficient – need to evaluate objective function for all data points
![Page 18: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/18.jpg)
The CARE Algorithm
• Distance-based point deletion heuristic Let S1 be the subspace spanned by the k eigenvectors with
the smallest eigenvalues Let S2 be the subspace spanned by the remaining n-k
eigenvectors. Intuition: Try to reduce the variance in S1 as much as
possible while retaining the variance in S2
Directly delete (1-δ)M points having large variance in S1 and small variance in S2 (refer to paper for details)
![Page 19: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/19.jpg)
The CARE Algorithm
A comparison between two point deletion heuristics
successive distance-based
![Page 20: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/20.jpg)
Experimental Results (Synthetic)
Linear correlation reestablished
Full-dimensional PCA CARE
Linear correlation embedded
![Page 21: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/21.jpg)
Pair-wise correlations
Linear correlation embedded (hyperplan representation)
Experimental Results (Synthetic)
![Page 22: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/22.jpg)
Scalability evaluation
Experimental Results (Synthetic)
![Page 23: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/23.jpg)
Experimental Results (Wage)
Correlation clustering method & CARE
6 AYWYE
CARE only
A comparison between correlation clustering method and CARE(dataset (534×11) http://lib.stat.cmu.edu/datasets/CPS_85_Wages)
805.425.4 AWYW
![Page 24: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/24.jpg)
Experimental Results
Linearly correlated genes (Hyperplan representations) (220 genes for 42 mouse strains)
Nrg4: cell partMyh7: cell part; intracelluar partHist1h2bk: cell part; intracelluar partArntl: cell part; intracelluar part
Nrg4: integral to membraneOlfr281: integral to membraneSlco1a1: integral to membraneP196867: N/A
Oazin: catalytic activityCtse: catalytic activityMgst3: catalytic activity
Hspb2: cellular physiological process2810453L12Rik: cellular physiological process1010001D01Rik: cellular physiological processP213651: N/A
Ldb3: intracellular partSec61g: intracellular partExosc4: intracellular partBC048403: N/A
Mgst3: catalytic activity; intracellular part Nr1d2: intracellular part; metal ion bindingCtse: catalytic activityPgm3: metal ion binding
Hspb2: cellular metabolismSec61b: cellular metabolismGucy2g: cellular metabolismSdh1: cellular metabolism
Ptk6: membraneGucy2g: integral to membraneClec2g: integral to membraneH2-Q2: integral to membrane
![Page 25: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/25.jpg)
25
An example
![Page 26: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/26.jpg)
26
An example
Result of applying PCA Result of applying ISOMAP
![Page 27: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/27.jpg)
27
Finding local correlations• Dimension reduction
performs a single feature transformation for the entire dataset
• To find local correlationsFirst: identify the correlated feature subspacesThen: apply dimension reduction methods to
uncover the low dimensional structureDimension reduction addresses the second aspectOur focus is the first aspect
![Page 28: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/28.jpg)
28
Finding local correlations• Challenges
Modeling subspace correlations Measurements for pair-wise correlations may not suffice.
Searching algorithmExhaustive enumeration is too time consuming.
![Page 29: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/29.jpg)
29
Modeling correlated subspaces
• Intrinsic dimensionality the minimum number of free
variables required to define the data without any significant information loss
• Correlation dimension as ID estimator
![Page 30: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/30.jpg)
30
Modeling correlated subspaces• Strong correlation
subspace V and feature fa has strong correlation if
• Redundancy feature fvi in subspace V is redundant if
)(}){(),( VIDfVIDfVID aa
),/(ii vv ffVID
![Page 31: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/31.jpg)
31
Modeling correlated subspaces• Reducible Subspace and Core Space
subspace Y is reducible if there exist subspace V of Y, such that
(1)
(2) , U is non-redundant V is the core space of Y, and Y is reducible to V
),(, aa fVIDYf
|)||(| VUYU
all features in Y are strongly correlated with the cores space V
the core space is the smallest non-redundant subspace Y, with which all other features in Y are strongly correlated
![Page 32: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/32.jpg)
32
Modeling correlated subspaces
• Maximum reducible subspaceY is a reducible subspace and V is its core spaceY is maximum if it includes all features that are
strongly correlated with core space V
• GoalTo find all maximum reducible subspaces in the full
dimensional space
![Page 33: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/33.jpg)
33
Finding reducible subspaces• General idea
First find the overall reducible subspace (OR), which is the union of all maximum reducible subspaces
Then identify the individual maximum reducible subspaces (IR) from OR
![Page 34: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/34.jpg)
34
Finding OR
• Property suppose Y is a maximum reducible subspace
with core space V, then any subspace U of Y, if |U|=|V|, U is also a core space of Y
• Let RFfa be the remaining features in the datasets after deleting fa, then we have
• A linear scan of all the features in the dataset can find OR
}),(|{ afa fRFIDfORa
![Page 35: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/35.jpg)
35
Finding Individual RS
• Assumption maximum reducible subspaces are disjoint
• Method enumerate candidate core space from size 1 to |
OR| a candidate core space is a subset of OR
find features that are strongly correlated with candidate core space and remove them from OR
![Page 36: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/36.jpg)
36
Finding Individual RS
• Determine if a feature is strongly correlated with candidate core space ID-base method :quadratic to number
of data points Sampling based method: sample some
data points and see the number of data points distributed around them
see paper for details
![Page 37: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/37.jpg)
37
Experimental result
A synthetic dataset consisting of 50 features with 3 RS
![Page 38: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/38.jpg)
38
Experimental result
Efficiency evaluation on finding OR
![Page 39: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/39.jpg)
39
Experimental result
Sampling v.s. ID based method on finding Individual RS
![Page 40: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/40.jpg)
40
Experimental result
Reducible subspaces in NBA dataset (from ESPN website)28 features for 200 players
![Page 41: Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University](https://reader035.vdocuments.site/reader035/viewer/2022062803/56649f565503460f94c7a196/html5/thumbnails/41.jpg)
41
Thank You !