![Page 1: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/1.jpg)
Thesis ProposalLearning with Sparsity: Structures,
Optimization and Applications
Xi ChenCommittee Members: Jaime Carbonell (chair),
Tom Mitchell, Larry Wasserman, Robert Tibshirani
Machine Learning DepartmentCarnegie Mellon University
![Page 2: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/2.jpg)
Modern Data Analysis
Gene expression data for tumor classification: Characteristic: High-dimensional; Very few samples; complex structure Climate Data
Characteristic: Dynamic complex structure
Web-text data:Characteristic: Both high-dimensional& massive amountStructures of word features (e.g., synonym)
Challenges : High-dimensions
Complex & Dynamic Structures
2
![Page 3: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/3.jpg)
Solutions: Sparse Learning
Sparse regression for feature selection & prediction
Incorporating Structural Prior Knowledge
Nonparametric Sparse Regression: flexible model
[Tibshirani 96]
Smooth Convex Loss L1-regularization
[Jenatton et al., 09, Peng et al., 09Tibshirani et al., 05Friedman et al., 10Kim et al., 10]
Structured Penalty (e.g., group, hierarchical tree, graph)
[Ravikumar et al., 09] 3
AdditiveModel
![Page 4: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/4.jpg)
Sparse Learning in Graphical Models
Undirected Graphical Model (Markov Random Fields)
Learn Sparse Structure of Graphical Models
Gene GraphPairwise model for
image
4
Graphical Lasso (gLasso) ( Yuan et al. 06, Friedman et al. 07, Banerjee et al. 08)
Iterated Lasso (Meishausen and Buhlmann, 06)
Forest Density Estimator (Liu et al. 10)
![Page 5: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/5.jpg)
Thesis OverviewHigh-dimensional Sparse Learning with Structures
Sparse Single/Multi-task Regression with General Structured- Penalty
Learning Sparse Structures for Undirected Graphical Models
Nonparametric Sparse Regression
Challenge: Computation
Completed Work:Unified Optimization Framework: Smoothing Proximal Gradient [UAI 11, AOAS]
Future Work: (1) Online Learning for Massive Data (2) Incorporate Structured-Penalty in Other Models (e.g. PCA, CCA)
Existing: Static or Time-varying GraphChallenge: Dynamic Structures
Completed Work:Conditional Gaussian Graphical Model (1) Kernel Smoothing Method for
Spatial-Temporal Graphs [AAAI 10](2) Partition-Based Method [NIPS 10]
Future Work: Relax Conditional Gaussian Assumption: Continuous & Discrete
Existing: Additive Models Challenge: (1) Generalized Models, (2) Structures
Completed Work(1) Generalized Forward
Regression [NIPS 09](2) Penalized Tree
Regression [NIPS 10]
Future Work: Incorporating Rich Structures
Application areas: tumor classification using gene expression data [UAI 11, AOAS], climate data analysis [AAAI 10, NIPS 10], web-text mining [ICDM 10, SDM 10] 5
![Page 6: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/6.jpg)
Roadmap
Q & A
Structure Learning in Graphical Models
Summary and Timeline
Smoothing Proximal Gradient for Structured Sparse Regression
Nonparametric Sparse Regression
6
![Page 7: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/7.jpg)
Useful Structures and Structured Penalty
Group Structure (group-wise selection)
[Yuan 06]
[Peng et al 09, Kim et al 10]
[Bach et al., 09]
Application: pathway selection for gene-expression data in tumor classification
Example: WordNet 7
![Page 8: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/8.jpg)
Useful Structure and Structured Penalty
Graph Structure (to enforce smoothness)
Piece-wise constant
[Kim et al., 10]
[Tibshirani 05]
Graph smoothness
8
![Page 9: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/9.jpg)
Challenge
Unified, Efficient and Scalable Optimization Framework
for Solving all these Structured Penalties
Single-task Regression
Multi-task Regression
9
Nonsmooth
Nonseparable
![Page 10: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/10.jpg)
Existing Optimization
Interior-point Method (IPM) for Second-order Cone Programming (SOCP) / Quadratic Programming (QP)
Poor scalability: solving a huge Newton linear system for each iteration
Sub-gradient Descent(first-order method) Convergence is slow
Block Coordinate Descent
Cannot be applied for non-separable penalties
Optimize at one time while keeping other variables fixed
Proximal Gradient Descent (first-order method)
Cannot be applied to complex structured penalty.No exact solution for proximal operator
Augmented Lagrangian Alternating Direction
No convergence rate resultGlobally converge only for 2 blocks
Solving a large-scale linear system for each iteration
[Nesterov 07, Beck and Teboulle, 09]
10
Proximal Operator:
![Page 11: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/11.jpg)
Overview: Smoothing Proximal Gradient (SPG)
First-order Method (only gradient info): fast and scalable No exact solution for proximal operator Idea: 1) Reformulate the structured penalty (via the dual norm)
2) Introduce its smooth approximation
3) Plug the smooth approximation back into the original problem and solve it by accelerated proximal gradient methods Convergence Results:
[Nesterov 05]
11
![Page 12: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/12.jpg)
Why the Approximation is Smooth?
Geometric Interpretation:
Uppermost LineSmooth
Uppermost LineNonsmooth
12
![Page 13: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/13.jpg)
Smoothing Proximal Gradient (SPG)
Original Problem:
Approximated Problem:
Gradient of the Approximation(Danskin’s Theorem)
Smooth function
Non-smooth with good separability
Convex Smooth Loss Non-smooth Penalty with complex structure
13[Nesterov 07, Beck and Teboulle, 09]
Proximal Operator: Soft-thresholding
![Page 14: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/14.jpg)
Convergence Rate
14
SPG Subgradient IPM
Convergence Rate
Per-iteration Time / Storage
Cheap (Gradient)
Cheap (Gradient)
Expensive(Newton Linear
System/Hessian)
![Page 15: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/15.jpg)
Single-Task Multi-Task
Multi-Task Extension
15
![Page 16: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/16.jpg)
Simulation Study
Multi-task Graph-guided Fused Lasso
ACGTTTTACTGTACAATTTACSNP
Gene-expression data
16
![Page 17: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/17.jpg)
Biological Application
Breast Cancer Tumor Classification Gene expression data for 8,141 genes in 295 breast cancer tumors.
(78 metastatic and 217 non-metastatic, logistic regression loss)
Canonical pathways from MSigDB containing 637 groups of genes
Training:Test=2:1 Important pathways: proteasome, nicotinate (ENPP1) 17
SPG for Overlapping Group Lasso Regularization path (20 parameters): 331 seconds
![Page 18: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/18.jpg)
Proposed Research
More applications for SPG
Web-scale learning: massive amounts of data• Inputs arrive sequentially at a high-rate• Need to provide real-time service
Solution: Stochastic Optimization for Online Learning
Complex Structured Penalty: Smoothing Technique
Simple Penalty with good separability: closed-form solution in proximal operator
E.g. Low Rank + Sparse
18
![Page 19: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/19.jpg)
Proposed Research
Stochastic Optimization
Structured Sparsity: Beyond Regression• Canonical Correlation Analysis and its Application in
Genome-wide Association Study
Deterministic: Stochastic:
Existing Methods : RDA [Lin 10] , Accelerated Stochastic Gradient Descent [Lan et al. 10]
Ruin the sparsity-pattern
Goal: sparsity-persevering stochastic optimization for large-scale online learning
19
![Page 20: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/20.jpg)
Roadmap
Q & A
Structure Learning in Graphical Models
Summary and Timeline
Nonparametric Sparse Regression
20
Smoothing Proximal Gradient for Structured Sparse Regression
![Page 21: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/21.jpg)
Gaussian Graphical Model
Gaussian Graphical Model
Graphical Lasso (gLasso)
Challenge: Dynamic Graph Structure
[Yuan et al., 06, Friedman et al., 07Banerjee et al., 08]
21
gLasso
[Lauritzen 96]
![Page 22: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/22.jpg)
Idea: Graph-Valued Regression
Multivariate Regression Undirected Graphical Model
Graph-Valued Regression:
Application:
[Zhou et al., 08Song et al., 09]
Input data:
22
![Page 23: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/23.jpg)
Applications for higher dimensional X
X: Patient Symptoms CharacterizationY: Gene expression levels
23
![Page 24: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/24.jpg)
Kernel Smoothing Estimator
Conditional Gaussian Assumption
Kernel Smoothing Estimator
Cons: (1) Unstable when the dimension of x is high (2) Computationally heavy and difficult to
analyze (3) Hard to Visualize
bG(x)
24
![Page 25: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/25.jpg)
Partition Based Estimator
Partition Based Estimator: Graph-Optimized CART(Go-CART)
CART (Classification and Regression Tree)
Graphical model: difficult to search for the split point
[Breiman 84, Tibshirani et al.,09]
25
![Page 26: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/26.jpg)
Dyadic Partitioning Tree
Dyadic Partitioning Tree (DPT)
Assumptions and Notations:
[Scott and Nowak, 04]
26
![Page 27: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/27.jpg)
Graph-Optimized CART (Go-CART)
Go-CART: penalized risk minimization estimator
Go-CART: held-out risk minimization estimator• Split the data:
Practical algorithm: greedy learning using held-out data27
![Page 28: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/28.jpg)
Statistical Property
We do not assume that underlying partition is dyadic Oracle Risk
Oracle Inequality: bound the oracle excessive risk
Add the assumption that underlying partition is dyadic: Tree Partitioning Consistency(might obtain finer partition)
28
![Page 29: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/29.jpg)
Real Climate Data Analysis
Data Description 125 locations of U.S. 1990 ~ 2002 (13 years) Monthly observation (18 variables/factors)
CO2CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTRTMN
TMP
TMX
GLO
ETR
ETRN
DIR
UV
Variables Type
CO2, CH4, H2, CO Greenhouse Gases
Precipitation (PRE); Vapor (VAP); Cloud Cover (CLD); Wet Days (WET); Frost Days (FRS)
Weather
Avg. Temp. (TMP); Diurnal Temp. Range (DTR); Min. Temp. (TMN); Max. Temp. (TMX)
Temperature
Global Radiation (GLO); Direct Radiation (DIR); Extraterrestrial Global Radiation (ETR) Extraterrestrial Direct Normal Radiation (ETRN)
Solar Radiation
Ultraviolet irradiance (UV) Aerosol Index
[Lozano et al., 09, IBM]
29
![Page 30: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/30.jpg)
Real Climate Data Analysis
Observations: (1): For graphical lasso, no edge connects greenhouse gases (CO2, CH4, CO, H2) with solar radiation factors (GLO, DIR) which contradicts IPCC report; Co-CART, there is. (2): Graphs along the coasts are more sparse than the ones in the mainland.
30
glasso
![Page 31: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/31.jpg)
Proposed Research
Limitations of Go-CART
(1) Conditional Gaussian Assumption:
(2) Only for continuous Y. For discrete Y : approximation likelihood
Forest Graphical Model• Density only involves univariate and bivariate marginals • Compute mutual information for each pair of variables • Greedily learn the tree structure via Chow-Liu algorithm• Handle both continuous and discrete data
Forest-Valued Regression
[Chow and Liu, 68,Tan et al., 09, Liu et al., 11]
31
![Page 32: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/32.jpg)
Roadmap
Q & A
Structure Learning in Graphical Models
Summary and Timeline
Nonparametric Sparse Regression
32
Smoothing Proximal Gradient for Structured Sparse Regression
![Page 33: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/33.jpg)
Nonparametric Regression
Parametric Models
Additive Models
Sparse Additive Models
Generalized Nonparametric Models: model interaction between variables
[Ravikumar et al., 09]
Bottleneck: Computation
[Hastie et al., 90]
33
![Page 34: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/34.jpg)
My Work and Proposed Research
Greedy Learning Method • Additive Forward Regression (AFR)
– Generalization of Orthogonal Matching Pursuit to Non-parametric setting
• Generalized Forward Regression (GFR)
Penalized Regression Tree Method
Proposed Research: • Formulate the functional forms for structured penalties • Develop efficient algorithms for solving the corresponding
nonparametric structured sparse regression
[Tropp et al., 06]
34
![Page 35: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/35.jpg)
Roadmap
Q & A
Structure Learning in Graphical Models
Summary and Timeline
Nonparametric Sparse Regression
35
Smoothing Proximal Gradient for Structured Sparse Regression
![Page 36: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/36.jpg)
Summary and Timeline
36
![Page 37: Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications Xi Chen Committee Members: Jaime Carbonell (chair), Tom Mitchell, Larry](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d705503460f94a52159/html5/thumbnails/37.jpg)
Acknowledgements
My Committee Members
Jaime Carbonell (advisor),
Tom Mitchell,
Larry Wasserman,
Robert Tibshirani
Acknowledgements: Eric P. Xing, John Lafferty, Seyoung Kim, Manuel Blum, Aarti Singh,
Jeff Schneider, Javier Pena, Han Liu, Qihang Lin, Junming Yin,
Xiong Liang, Tzu-Kuo Huang, Min Xu, Mladen Kolar, Yan Liu,
Jingrui He, Yanjun Qi, Bing Bai
IBM Fellowship
Feedback: Xi Chen ([email protected]) 37