![Page 1: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/1.jpg)
COMBINATORIAL OPTIMIZATION AND SPARSE COMPUTATION FOR IMAGE SEGMENTATION AND LARGE SCALE
DATA MINING
Dorit S. Hochbaum University of California, Berkeley
LNMB, Jan 2019
![Page 2: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/2.jpg)
OVERVIEW
• Motivation: Image segmentation and normalized cut
• Insights on how combinatorial optimization relates to spectral clustering
• Efficient polynomial time algorithm(s) • The power of using pairwise similarities • Lessons from experimental studies on
effectiveness for image segmentation and for machine learning and data mining classification tasks.
![Page 3: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/3.jpg)
NOTATIONS AND PRELIMINARIES
3 Dorit Hochbaum UC Berkeley
![Page 4: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/4.jpg)
AN INTUITIVE CLUSTERING CRITERION
4
Find a cluster that combines two objectives: One, is to have large similarity within the cluster, and to have small similarity between the cluster and its complement.
The combination of the two objectives can be expressed as:
We call this problem H-normalized-cut (H for Hochbaum), or HNC.
Dorit Hochbaum UC Berkeley
HNC2
HNC1
HNC3
![Page 5: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/5.jpg)
MOTIVATION FOR THE HNC1 PROBLEM
5 Dorit Hochbaum UC Berkeley
NP-hard?? Same problem??
![Page 6: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/6.jpg)
HNC is poly time solvable: monotone IP3 (Hochbaum2010)
[H10,H13]
This formulation is monotone
![Page 7: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/7.jpg)
HOW DO NC AND HNC1 COMPARE [H10]
Shi and Malik 2000:
HNC1:
7 Dorit Hochbaum UC Berkeley
![Page 8: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/8.jpg)
Solving HNC1 with the Spectral method [Sharon et al. 07] and optimally [H10]
• The {0,1} discrete values of x are relaxed. This continuous problem was shown to be solved approximately by an eigenvector.
( )( ) =
⊂ SdSSC
VS
,min
{ }
( )Wxx
xxxxwxxw
T
T
jiij
jiij
xi
L=
−
∑∑
∈
2
1,0min
W is not a diagonal matrix.
However, using the formulation:
{ }
( )Dxx
xxxd
xxwT
T
ii
jiij
xi
L=
−
∑∑
∈ 2
2
1,0min
![Page 9: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/9.jpg)
How does the HNC1 solution relate to the spectral solution?
We will answer a more general question: Consider q-normalized cut
Dorit Hochbaum UC Berkeley 13
![Page 10: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/10.jpg)
TWO-TERMS FORMS OF THE PROBLEMS
14 Dorit Hochbaum UC Berkeley
![Page 11: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/11.jpg)
Lemma 1 (cf. Hochbaum 2011): A special case of this was shown by Shi and Malik.
TWO TERMS EXPRESSIONS AND THE RAYLEIGH RATIO
16 Dorit Hochbaum UC Berkeley
![Page 12: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/12.jpg)
THE COMBINATORIAL VS. THE SPECTRAL CONTINUOUS RELAXATIONS
Raleigh ratio Problem (RRP)
Spectral continuous relaxation Combinatorial relaxation
17 Dorit Hochbaum UC Berkeley
![Page 13: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/13.jpg)
THE SPECTRAL METHOD
• Where λ is the smallest non-zero eigenvalue (Fiedler Eigenvalue). We solve for the eigenvector z:
• and set which solves the continuous relaxation.
18 Dorit Hochbaum UC Berkeley
![Page 14: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/14.jpg)
SOLVING THE COMBINATORIAL RELAXATION
20 Dorit Hochbaum UC Berkeley
![Page 15: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/15.jpg)
THE COMBINATORIAL RELAXATION RAYLEIGH PROBLEM
Lemma 2:
22 Dorit Hochbaum UC Berkeley
![Page 16: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/16.jpg)
The problem is a ratio problem General technique for ratio Problems: The λ-question can be solved if one can solve the following λ-question: *This λ is unrelated to an eigenvector –just a parameter
SOLVING THE COMBINATORIAL RAYLEIGH PROBLEM OPTIMALLY
23 Dorit Hochbaum UC Berkeley
![Page 17: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/17.jpg)
The λ-question of whether the value of RRP is less than λ is equivalent to determining whether: Or from Lemma 1, this is equivalent to:
SOLVING THE LAMBDA QUESTION
24 Dorit Hochbaum UC Berkeley
![Page 18: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/18.jpg)
THE GRAPH GST FOR TESTING THE LAMBDA-QUESTION: A CASE OF IP ON MONOTONE CONSTRAINTS [HOC02]
s t
i j
S’
λqj
λqi
λb2qj
λb2qi
(1+b)2wji
(1+b)2wij
∞
∞ t’
25 Dorit Hochbaum UC Berkeley
![Page 19: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/19.jpg)
The source set of a minimum cut in the graph Gst is an optimal solution to the linearized (RRP) Proof:
THEOREM
26 Dorit Hochbaum UC Berkeley
![Page 20: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/20.jpg)
SIMPLIFYING THE GRAPH (TO MAKE IT PARAMETERIC)
s t
s t
λqi
λb2qi
b<1 b>1
i j i j
i j (1+b)2wji
i2 )qb-(1λ
ji2 wb)(1+ ji
2 wb)(1+
i2 1)q-(bλ
27 Dorit Hochbaum UC Berkeley
![Page 21: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/21.jpg)
SCALING ARCS WEIGHTS
i2 )qb-(1λ
s t
b<1 b>1
i j i j
∞ S’
t’ ∞
ji2 wb)(1+ ji
2 wb)(1+i
2 1)q-(bλ
28 Dorit Hochbaum UC Berkeley
![Page 22: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/22.jpg)
SCALING ARCS WEIGHTS
s t
b<1 b>1
i j i j
∞ S’
t’ ∞
iqb)(1b)-(1
+λ iq
b)(11)-(b
+λ
jiw jiw
29 Dorit Hochbaum UC Berkeley
![Page 23: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/23.jpg)
THE SIMPLIFIED EQUIVALENT GRAPH
i s
b<1 ( )( )b
bqi +−
11λ
s' ∞
j t
t' ∞
( )( )b
bq j +−
11λ
i s
b>1
( )( )bbq j +−
11λ
s' ∞
j t
t' ∞
( )( )bbqi +−
11λ
jiw
jiw
30 Dorit Hochbaum UC Berkeley
![Page 24: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/24.jpg)
The problem is a parametric cut problem: This is a graph setup when source adjacent arcs are monotone nondecreasing and sink adjacent are monotone nonincreasing (for b<1) with the parameter. A parametric cut problem can be solved in the complexity of a single minimum cut (plus finding the zero of n monotone functions) [GGT89], [H08]. Here we let the parameter be β
SOLVING THE PROBLEM AS A PARAMETRIC MIN-CUT
≥+−
<+−
=1
11
111
bb
b
bbb
λ
λβ
31 Dorit Hochbaum UC Berkeley
![Page 25: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/25.jpg)
The cut problem in the graph Gst, as a function of β is parametric (the capacities are linear in the parameter on one side and independent of it on the other). In a parametric graph the sequence of source sets of cuts for increasing source-adjacent capacities is nested. There are no more than n breakpoints for β. There are k≤n nested source sets of minimum cuts.
IN GST
32 Dorit Hochbaum UC Berkeley
![Page 26: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/26.jpg)
For Given the values of β at the breakpoints, we can generate, for each value of b, all the breakpoints. Consequently, by solving once the parametric problem for β we obtain simultaneously, all the breakpoint solutions for all b, in the complexity of a single minimum cut. For each b we find the last (largest value) breakpoint where the objective value <0.
SOLVING FOR ALL VALUES OF B EFFICIENTLY
≥+−
<+−
=1
11
111
bb
b
bbb
λ
λβ
33 Dorit Hochbaum UC Berkeley
![Page 27: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/27.jpg)
RECALL PROBLEM HNC1: IT IS A SPECIAL CASE
34 Dorit Hochbaum UC Berkeley
It is equal to the problem solved for b=0:
which has the same solution as:
![Page 28: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/28.jpg)
IMAGE SEGMENTATION WITH HNC1 VS SPECTRAL
Original Image Shi & Malik HNC1
41035 −⋅=NC 410702.1 −⋅=NC
Eigenvector result HNC1 result Original image
![Page 29: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/29.jpg)
Another comparison Original Image Shi & Malik NC’
410127 −⋅=NC 410466.1 −⋅=NC
Eigenvector result NC’ result Original image
![Page 30: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/30.jpg)
Spectral objective Vs. Combinatorial algorithm's objective [H,Lu,Bertelli13]
( )( )ialCombinator
SpectralObj
Obj
![Page 31: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/31.jpg)
SCALABILITY OF THE ALGORITHM
![Page 32: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/32.jpg)
HNC can be applied to binary classification problems • Unsupervised:
Method finds a cluster distinct from the rest of the nodes and similar to itself
• Supervised (called SNC): Training data is linked a-priori to either the source or the sink, based on the
respective labels
HNC was successfuly used in data mining contexts (e.g., denoising spectra of nuclear detectors [YFHNS2014], ranking drugs according to their effectiveness, [HHY2012] and it has been a leading algorithm in the Neurofinder benchmark for cell identification in calcium imaging movies [SHA2017].)
HNC IN DATA MINING
41 Dorit Hochbaum UC Berkeley
![Page 33: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/33.jpg)
TESTING THE EFFECTIVENESS OF HNC AS A DATA MINING PROCEDURE [BAUMANN, H, YANG,17]
Dorit Hochbaum UC Berkeley 42
Two variants of HNC were tested: 1.The node weights are di SNC (supervised HNC) 2.The node weights are the average label of k
nearest neighbors SNCK
![Page 34: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/34.jpg)
DATA SETS FROM UCI AND LIBSVM REPOSITORY
Dorit Hochbaum UC Berkeley 43
![Page 35: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/35.jpg)
LOWER AND UPPER BOUNDS OF TUNING PARAM. VALUES
Dorit Hochbaum UC Berkeley 45
![Page 36: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/36.jpg)
PARTITIONING OF DATA SETS
Dorit Hochbaum UC Berkeley 47
![Page 37: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/37.jpg)
LOWER AND UPPER BOUNDS OF TUNING PARAM. VALUES
Dorit Hochbaum UC Berkeley 48
![Page 38: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/38.jpg)
F1-SCORE, PRECISION, RECALL, ACCURACY
![Page 39: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/39.jpg)
NORMALIZED F1-SCORE (TESTED FOR SAME TUNING TIME)
Dorit Hochbaum UC Berkeley 51
SNC achieves best and most robust performance across data sets
![Page 40: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/40.jpg)
RANK OF TECHNIQUES BASED ON F1-SCORE
Dorit Hochbaum UC Berkeley 52
![Page 41: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/41.jpg)
STANDARD DEVIATION OF F1-SCORE ACROSS SPLITS
Dorit Hochbaum UC Berkeley 60
![Page 42: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/42.jpg)
EVALUATION TIME [SEC]
Dorit Hochbaum UC Berkeley 61
![Page 43: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/43.jpg)
TUNING TIME [SEC]
Dorit Hochbaum UC Berkeley 62
![Page 44: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/44.jpg)
TAKE AWAYS
All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data mining and clustering algorithms that perform well (e.g., KNN and SVM with kernels methods) require as input a similarity matrix The number of pairwise similarities grows quadratically in the size of the data sets
![Page 45: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/45.jpg)
Known Literature: Existing sparsification approaches require complete matrix as input
-> not applicable for massively large datasets Proposed methodolgy: Sidesteps the computationally expensive task of constructing the complete similarity matrix Generating only the relevant entries in the similarity matrix without performing pairwise comparisons
SPARSE COMPUTATION FOR LARGE-SCALE DATA MINING WITH PAIRWISE
COMPARISONS
Dorit Hochbaum UC Berkeley 65
![Page 46: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/46.jpg)
Input: Data set as an n x d matrix A containing n objects with d attributes: Output: Sparse n x n similarity matrix Procedure: 1. Embed d-dimensional space in a p-dimensional space for p<<d with
the use of approximate Principal Component Analysis (PCA) – based on ConstantTimeSVD of Drineas, Kannan, and Mahoney (2006). Pick r rows/objects of the matrix/dataset
2. Subdivide the range of values in each dimension into k intervals of equal length (can use a different number of intervals in each dimension
3. Assign each object to a single block based on its p entries. O(1) work per object
4. Compute distances between objects that are assigned to the same block in original d-dimensional space
5. Identify neighboring blocks and compute similarities between objects in those blocks.
SPARSE COMPUTATION WITH APPROXIMATE PCA
Dorit Hochbaum UC Berkeley 66
![Page 47: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/47.jpg)
Example of block data structure in the space of the p = 3 leading principal components. Here the grid resolution k = 5 and the length of the intervals is the same for all dimensions
BLOCK DATA STRUCTURE FOR SPARSIFICATION
Dorit Hochbaum UC Berkeley 67
![Page 48: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/48.jpg)
Data set with 583 objects and 10 attributes Blue dots represent 416 liver patients and green dots represent 167 non-liver patients Two clusters refer to male and female patients
EFFECTIVENESS OF APPROXIMATE-PCA
Dorit Hochbaum UC Berkeley 68
Exact PCA Approximate-PCA with r=5
![Page 49: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/49.jpg)
Source: Machine Learning Repository of the University of California at Irvine Selection criteria:
• Thousands of objects • Data from different domains • Balanced and unbalanced data sets
EMPIRICAL ANALYSIS: LARGE SCALE DATASETS
Dorit Hochbaum UC Berkeley 70
![Page 50: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/50.jpg)
EXPERIMENTAL DESIGN
Dorit Hochbaum UC Berkeley 71
Tuning: • Grid search • Exponential similarity • Tuning parameters:
• Epsilon = {1,...,30} • Lambda = {10-5,...,10-1} • Normalization of input data
• Sub-sampling validation • Complete similarity matrix
Testing: • Number of rows for appr.-PCA
• CO2: r = 100 • Remaining sets: r = 30
• Value for grid resolution k • CAR, LE1, LE2: k = {2,...,20} • ADU, BAN, CO1: k = {3,...,30} • CO2: k = {100,...,500}
• k=2 corresponds to complete similarity matrix
• k>2 generates sparse similarity matrix
Implementation: Matlab and C Machine: Workstation with two Intel E5-2687W (3.1 GHz) and 128 GB RAM
![Page 51: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/51.jpg)
COMPUTATIONAL RESULTS FOR ALL DATA SETS EXCEPT CO2
Dorit Hochbaum UC Berkeley 73
![Page 52: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/52.jpg)
RESULTS FOR DATA SET CO2:
Dorit Hochbaum UC Berkeley 77
• Accuracy achieved with the very sparse similarity matrices very similar to accuracies obtained based on complete similarity matrix
• Accuracy changes little with increasing grid resolution • Running time decreases substantially (roughly proportional to density) • CO2: Accuracy of 89.72% possible with density of 0.008%. Complete
similarity matrix would contain over 54 billion entries
![Page 53: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/53.jpg)
SUMMARY • A clustering/classification optimization model
and a combinatorial optimization algorithm that uses pairwise comparisons
• Effective for general classification tasks as well as for specific application contexts
• Efficient in theory and in practice • The approach of sparse computation enable
the use of the method for massively large data sets.
![Page 54: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/54.jpg)
D.S. Hochbaum. Polynomial time algorithms for ratio regions and a variant of normalized cut. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32 (5), 889-898, May 2010. D.S. Hochbaum. A polynomial time algorithm for Rayleigh ratio on discrete variables: Replacing spectral techniques for expander ratio, normalized cut and Cheeger constant. Operations Research , Vol. 61:1, 2013, pp. 184-198 D.S. Hochbaum, C.N. Hsu, Y.T. Yang. Ranking of Multidimensional Drug Profiling Data by Fractional Adjusted Bi-Partitional Scores. Bioinformatics (2012) 28:12, pp. 106-114. D.S. Hochbaum, C. Lu, E. Bertelli. "Evaluating Performance of Image Segmentation Criteria and Techniques". EURO Journal on Computational Optimization, Vol 1:1-2, May 2013, pp. 155-180. D.S. Hochbaum, P. Baumann and Y. T. Yang. A comparative study of machine learning methodologies and two new effective optimization algorithms. EJOR, 2019. D.S. Hochbaum and P. Baumann. Sparse Computation for Large-Scale Data Mining. IEEE Transactions on Big Data, Vol 2, Issue 2, 151-174, 2016.
SELECTED REFERENCES
Dorit Hochbaum UC Berkeley 79
![Page 55: Dorit S. Hochbaum University of California, Berkeley€¦ · All pairwise comparisons’ classification algorithms perform better than other methods Challenge: SNC and other data](https://reader037.vdocuments.site/reader037/viewer/2022090606/605ae449d1aee86ef74e6914/html5/thumbnails/55.jpg)
Dorit S. Hochbaum [email protected] http://www.ieor.berkeley.edu/~hochbaum/
QUESTIONS
80 Dorit Hochbaum UC Berkeley