sparse proteomics analysis (spa) - tu berlin · sparse proteomics analysis (spa) toward a...
TRANSCRIPT
![Page 1: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/1.jpg)
Sparse Proteomics Analysis (SPA)Toward a Mathematical Theory for
Feature Selection from Forward Models
Martin Genzel
Technische Universitat Berlin
Winter School on Compressed SensingDecember 5, 2015
![Page 2: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/2.jpg)
Outline
1 Biological Background
2 Sparse Proteomics Analysis (SPA)
3 Theoretical Foundation by High-dimensional Estimation Theory
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 2 / 19
![Page 3: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/3.jpg)
1 Biological Background
2 Sparse Proteomics Analysis (SPA)
3 Theoretical Foundation by High-dimensional Estimation Theory
![Page 4: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/4.jpg)
What is Proteomics?
The pathological mechanisms of many diseases, such as cancer,are manifested on the level of protein activities.
To improve clinical treatment options and early diagnostics,we need to understand protein structures and their interactions!
Proteins are long chains of amino acids,controlling many biological andchemical processes in the human body.
The entire set of proteins at a certainpoint of time is called a proteome.
Proteomics is the large-scale study ofthe human proteome.
http://www.topsan.org/Proteins/JCSG/3qxb
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 3 / 19
![Page 5: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/5.jpg)
What is Proteomics?
The pathological mechanisms of many diseases, such as cancer,are manifested on the level of protein activities.
To improve clinical treatment options and early diagnostics,we need to understand protein structures and their interactions!
Proteins are long chains of amino acids,controlling many biological andchemical processes in the human body.
The entire set of proteins at a certainpoint of time is called a proteome.
Proteomics is the large-scale study ofthe human proteome.
http://www.topsan.org/Proteins/JCSG/3qxb
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 3 / 19
![Page 6: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/6.jpg)
What is Proteomics?
The pathological mechanisms of many diseases, such as cancer,are manifested on the level of protein activities.
To improve clinical treatment options and early diagnostics,we need to understand protein structures and their interactions!
Proteins are long chains of amino acids,controlling many biological andchemical processes in the human body.
The entire set of proteins at a certainpoint of time is called a proteome.
Proteomics is the large-scale study ofthe human proteome.
http://www.topsan.org/Proteins/JCSG/3qxb
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 3 / 19
![Page 7: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/7.jpg)
What is Mass Spectrometry?
How to “capture” a proteome?
Mass spectrometry (MS) is a popular technique to detect the abundanceof proteins in samples (blood, urine, etc.).
Schematic Work-Flow
Det
ecto
r
Laser
Inte
nsity
(cts
)
Mass (m/z)
+ ++ - - + +
Sample
Mass spectrum
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 4 / 19
![Page 8: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/8.jpg)
What is Mass Spectrometry?
How to “capture” a proteome?
Mass spectrometry (MS) is a popular technique to detect the abundanceof proteins in samples (blood, urine, etc.).
Schematic Work-Flow
Det
ecto
r
Laser
Inte
nsity
(cts
)
Mass (m/z)
+ ++ - - + +
Sample
Mass spectrum
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 4 / 19
![Page 9: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/9.jpg)
What is Mass Spectrometry?
How to “capture” a proteome?
Mass spectrometry (MS) is a popular technique to detect the abundanceof proteins in samples (blood, urine, etc.).
Schematic Work-Flow
Det
ecto
r
Laser
Inte
nsity
(cts
)
Mass (m/z)
+ ++ - - + +
Sample
Mass spectrum
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 4 / 19
![Page 10: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/10.jpg)
Real-World MS-Data
Mass (m/z)
Inte
nsity
(cts
)
MS-vector: x = (x1, . . . , xd) ∈ Rd , d ≈ 104 . . . 106
Index = Mass/Feature, Entry = Intensity/Amplitude
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 5 / 19
![Page 11: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/11.jpg)
Real-World MS-Data
Mass (m/z)
Inte
nsity
(cts
)
MS-vector: x = (x1, . . . , xd) ∈ Rd , d ≈ 104 . . . 106
Index = Mass/Feature, Entry = Intensity/Amplitude
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 5 / 19
![Page 12: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/12.jpg)
Real-World MS-Data
Mass (m/z)
Inte
nsity
(cts
)
MS-vector: x = (x1, . . . , xd) ∈ Rd , d ≈ 104 . . . 106
Index = Mass/Feature, Entry = Intensity/Amplitude
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 5 / 19
![Page 13: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/13.jpg)
Feature Selection from MS-Data
Goal: Detect a small set of features (disease fingerprint) that allows foran appropriate distinction between the diseased and healthy group.
Schematic Work-Flow
Blood from healthy individual
Blood from diseased individual
Samples
Mass (m/z)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 6 / 19
![Page 14: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/14.jpg)
Feature Selection from MS-Data
Goal: Detect a small set of features (disease fingerprint) that allows foran appropriate distinction between the diseased and healthy group.
Schematic Work-Flow
Mass (m/z)
MS
Mass (m/z)
Blood from healthy individual
Blood from diseased individual
MS
Inte
nsity
(cts
) In
tens
ity (c
ts)
Samples Mass Spectra
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 6 / 19
![Page 15: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/15.jpg)
Feature Selection from MS-Data
Goal: Detect a small set of features (disease fingerprint) that allows foran appropriate distinction between the diseased and healthy group.
Schematic Work-Flow
Mass (m/z)
MS
Mass (m/z)
Blood from healthy individual
Blood from diseased individual
Disease Fingerprint
Comparing
MS
Inte
nsity
(cts
) In
tens
ity (c
ts)
Samples Mass Spectra Feature Selection
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 6 / 19
![Page 16: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/16.jpg)
Mathematical Problem Formulation
Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn).
xk ∈ Rd : Mass spectrum of the k-th patientyk ∈ {−1,+1}: Health status of the k-th patient
(healthy = +1, diseased = −1)
Goal: Learn a feature vector ω ∈ Rd
which is
sparse, i.e., few non-zero entries,(⇒ stability, avoid overfitting)
and its entries correspond topeaks that are highly correlatedwith the disease.(⇒ interpretability, biologicalrelevance)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19
![Page 17: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/17.jpg)
Mathematical Problem Formulation
Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn).
xk ∈ Rd : Mass spectrum of the k-th patientyk ∈ {−1,+1}: Health status of the k-th patient
(healthy = +1, diseased = −1)
Goal: Learn a feature vector ω ∈ Rd
which is
sparse, i.e., few non-zero entries,(⇒ stability, avoid overfitting)
and its entries correspond topeaks that are highly correlatedwith the disease.(⇒ interpretability, biologicalrelevance)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19
![Page 18: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/18.jpg)
Mathematical Problem Formulation
Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn).
xk ∈ Rd : Mass spectrum of the k-th patientyk ∈ {−1,+1}: Health status of the k-th patient
(healthy = +1, diseased = −1)
Goal: Learn a feature vector ω ∈ Rd
which is
sparse, i.e., few non-zero entries,(⇒ stability, avoid overfitting)
and its entries correspond topeaks that are highly correlatedwith the disease.(⇒ interpretability, biologicalrelevance)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19
![Page 19: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/19.jpg)
Mathematical Problem Formulation
Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn).
xk ∈ Rd : Mass spectrum of the k-th patientyk ∈ {−1,+1}: Health status of the k-th patient
(healthy = +1, diseased = −1)
Goal: Learn a feature vector ω ∈ Rd
which is
sparse, i.e., few non-zero entries,(⇒ stability, avoid overfitting)
and its entries correspond topeaks that are highly correlatedwith the disease.(⇒ interpretability, biologicalrelevance)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19
![Page 20: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/20.jpg)
Mathematical Problem Formulation
Supervised Learning: We are given n samples (x1, y1), . . . , (xn, yn).
xk ∈ Rd : Mass spectrum of the k-th patientyk ∈ {−1,+1}: Health status of the k-th patient
(healthy = +1, diseased = −1)
Goal: Learn a feature vector ω ∈ Rd
which is
sparse, i.e., few non-zero entries,(⇒ stability, avoid overfitting)
and its entries correspond topeaks that are highly correlatedwith the disease.(⇒ interpretability, biologicalrelevance)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 7 / 19
![Page 21: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/21.jpg)
How to learn a fingerprint ω?
![Page 22: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/22.jpg)
1 Biological Background
2 Sparse Proteomics Analysis (SPA)
3 Theoretical Foundation by High-dimensional Estimation Theory
![Page 23: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/23.jpg)
Sparse Proteomics Analysis (SPA)
Sparse Proteomics Analysis is a generic framework to meet this challenge.
Input: Sample pairs (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}Compute:
1 Preprocessing (Smoothing, Standardization)
2 Feature Selection (LASSO, `1-SVM, Robust 1-Bit CS)
3 Postprocessing (Sparsification)
Output: Sparse feature vector ω ∈ Rd
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 8 / 19
![Page 24: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/24.jpg)
Sparse Proteomics Analysis (SPA)
Sparse Proteomics Analysis is a generic framework to meet this challenge.
Input: Sample pairs (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}Compute:
1 Preprocessing (Smoothing, Standardization)
2 Feature Selection (LASSO, `1-SVM, Robust 1-Bit CS)
3 Postprocessing (Sparsification)
Output: Sparse feature vector ω ∈ Rd
⇒ Biomarker extraction, dimension reduction
Mass (m/z)
Blood SampleBiomarker
Identification
Intensity
(cts)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 8 / 19
![Page 25: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/25.jpg)
Sparse Proteomics Analysis (SPA)
Sparse Proteomics Analysis is a generic framework to meet this challenge.
Input: Sample pairs (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}Compute:
1 Preprocessing (Smoothing, Standardization)
2 Feature Selection (LASSO, `1-SVM, Robust 1-Bit CS)
3 Postprocessing (Sparsification)
Output: Sparse feature vector ω ∈ Rd
Rest of this talk
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 8 / 19
![Page 26: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/26.jpg)
Feature Selection (Geometric Intuition)
Linear Separation Model: Find a feature vector ω ∈ Rd such that
yk = sign(〈xk ,ω〉) for “many” k ∈ {1, . . . , n}.
Moreover, ω should be sparse and interpretable.
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 9 / 19
![Page 27: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/27.jpg)
Feature Selection via the LASSO
The LASSO (Tibshirani ’96)
minω∈Rd
n∑k=1
(yk − 〈xk ,ω〉)2 subject to ‖ω‖1 ≤ R
Multivariate approach, originally designed for linear regression models:
yk ≈ 〈xk ,ω〉, k = 1, . . . , n.
But also applicable to non-linear models → Next part
Later: R ≈√s to allow for s-sparse solutions (with unit norm).
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 10 / 19
![Page 28: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/28.jpg)
Feature Selection via the LASSO
The LASSO (Tibshirani ’96)
minω∈Rd
n∑k=1
(yk − 〈xk ,ω〉)2 subject to ‖ω‖1 ≤ R
Multivariate approach, originally designed for linear regression models:
yk ≈ 〈xk ,ω〉, k = 1, . . . , n.
But also applicable to non-linear models → Next part
Later: R ≈√s to allow for s-sparse solutions (with unit norm).
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 10 / 19
![Page 29: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/29.jpg)
Some Numerical Results
5-fold cross-validation for real-world pancreas data (156 samples):
1 Learn feature vector ωby SPA, using 80% ofthe samples.
2 Classify the remaining20% of the sample by anordinary SVM, afterprojecting onto supp(ω).
3 Iterate this procedure12-times for randompartitions.
Classification accuracy for different sparsity levelss = # supp(ω)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 11 / 19
![Page 30: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/30.jpg)
But what abouttheoretical guarantees?
![Page 31: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/31.jpg)
1 Biological Background
2 Sparse Proteomics Analysis (SPA)
3 Theoretical Foundation by High-dimensional Estimation Theory
![Page 32: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/32.jpg)
Toward a Theoretical Foundation of SPA
Linear Separation Model: Explains the observations/labels:
yk = sign(〈xk ,ω0〉), k = 1, . . . , n
Forward Model: Explains the random distribution of the data:
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19
![Page 33: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/33.jpg)
Toward a Theoretical Foundation of SPA
Linear Separation Model: Explains the observations/labels:
yk = sign(〈xk ,ω0〉), k = 1, . . . , n
Forward Model: Explains the random distribution of the data:
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19
![Page 34: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/34.jpg)
Toward a Theoretical Foundation of SPA
Linear Separation Model: Explains the observations/labels:
yk = sign(〈xk ,ω0〉), k = 1, . . . , n
Forward Model: Explains the random distribution of the data:
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
am: Deterministic featureatom, sampledGaussian peak (∈ Rd)
sm,k : Random latent factorspecifying the peakamplitude (∈ R)
nk : Random baseline noise(∈ Rd)
𝑠",$ % exp −(% −𝑐")-
𝛽"-
𝑠",$
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19
![Page 35: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/35.jpg)
Toward a Theoretical Foundation of SPA
Linear Separation Model: Explains the observations/labels:
yk = sign(〈xk ,ω0〉), k = 1, . . . , n
Forward Model: Explains the random distribution of the data:
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
am: Deterministic featureatom, sampledGaussian peak (∈ Rd)
sm,k : Random latent factorspecifying the peakamplitude (∈ R)
nk : Random baseline noise(∈ Rd)
𝑠",$ % exp −(% −𝑐")-
𝛽"-
𝑠",$
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19
![Page 36: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/36.jpg)
Toward a Theoretical Foundation of SPA
Linear Separation Model: Explains the observations/labels:
yk = sign(〈xk ,ω0〉), k = 1, . . . , n
Forward Model: Explains the random distribution of the data:
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
am: Deterministic featureatom, sampledGaussian peak (∈ Rd)
sm,k : Random latent factorspecifying the peakamplitude (∈ R)
nk : Random baseline noise(∈ Rd)
𝑠",$ % exp −(% −𝑐")-
𝛽"-
𝑠",$
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19
![Page 37: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/37.jpg)
Toward a Theoretical Foundation of SPA
Linear Separation Model: Explains the observations/labels:
yk = sign(〈xk ,ω0〉), k = 1, . . . , n
Forward Model: Explains the random distribution of the data:
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
Supposed that sufficiently many samples are given,can we learn the sparse fingerprint ω0?
Problem: The vector ω0 is not unique becausesome features are perfectly correlated⇒ No hope for support recovery or approximation
Idea: Separate the fingerprintfrom its data representation!
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19
![Page 38: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/38.jpg)
Toward a Theoretical Foundation of SPA
Linear Separation Model: Explains the observations/labels:
yk = sign(〈xk ,ω0〉), k = 1, . . . , n
Forward Model: Explains the random distribution of the data:
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
Supposed that sufficiently many samples are given,can we learn the sparse fingerprint ω0?
Problem: The vector ω0 is not unique becausesome features are perfectly correlated⇒ No hope for support recovery or approximation
Idea: Separate the fingerprintfrom its data representation!
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19
![Page 39: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/39.jpg)
Toward a Theoretical Foundation of SPA
Linear Separation Model: Explains the observations/labels:
yk = sign(〈xk ,ω0〉), k = 1, . . . , n
Forward Model: Explains the random distribution of the data:
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
Supposed that sufficiently many samples are given,can we learn the sparse fingerprint ω0?
Problem: The vector ω0 is not unique becausesome features are perfectly correlated⇒ No hope for support recovery or approximation
Idea: Separate the fingerprintfrom its data representation!
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 12 / 19
![Page 40: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/40.jpg)
Combining the Models
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
Assumptions:
sk := (s1,k , . . . , sM,k) ∼ N (0, IM) – peak amplitudes
nk ∼ N (0, σ2Id ) – noise vector
a1, . . . , aM ∈ Rd – arbitrary (peak) atoms, D :=
a>1...
a>M
∈ RM×d
Put this into the classification model:
yk = sign(〈xk ,ω0〉) = sign(〈∑M
m=1 sm,kam + nk ,ω0〉)= sign(〈D>sk + nk ,ω0〉)
= sign(〈sk ,Dω0︸︷︷︸=:z0
〉+ 〈nk ,ω0〉)
= sign(〈sk , z0〉+ 〈nk ,ω0〉)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 13 / 19
![Page 41: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/41.jpg)
Combining the Models
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
Assumptions:
sk := (s1,k , . . . , sM,k) ∼ N (0, IM) – peak amplitudes
nk ∼ N (0, σ2Id ) – noise vector
a1, . . . , aM ∈ Rd – arbitrary (peak) atoms, D :=
a>1...
a>M
∈ RM×d
Put this into the classification model:
yk = sign(〈xk ,ω0〉) = sign(〈∑M
m=1 sm,kam + nk ,ω0〉)= sign(〈D>sk + nk ,ω0〉)
= sign(〈sk ,Dω0︸︷︷︸=:z0
〉+ 〈nk ,ω0〉)
= sign(〈sk , z0〉+ 〈nk ,ω0〉)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 13 / 19
![Page 42: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/42.jpg)
Combining the Models
xk =∑M
m=1 sm,kam + nk , k = 1, . . . , n
Assumptions:
sk := (s1,k , . . . , sM,k) ∼ N (0, IM) – peak amplitudes
nk ∼ N (0, σ2Id ) – noise vector
a1, . . . , aM ∈ Rd – arbitrary (peak) atoms, D :=
a>1...
a>M
∈ RM×d
Put this into the classification model:
yk = sign(〈xk ,ω0〉) = sign(〈∑M
m=1 sm,kam + nk ,ω0〉)= sign(〈D>sk + nk ,ω0〉) = sign(〈sk ,Dω0︸︷︷︸
=:z0
〉+ 〈nk ,ω0〉)
= sign(〈sk , z0〉+ 〈nk ,ω0〉)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 13 / 19
![Page 43: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/43.jpg)
Signal Space vs. Coefficient Space
xk =∑M
m=1 sm,kam + nk = D>sk + nk
Let us first assume that nk = 0 (no baseline noise). Then
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉),
where z0 = Dω0.
z0 has a (non-unique) representation in the dictionary Dwith sparse coefficients ω0.
z0 “lives” in the signal space RM (independent of specific data type).
ω0 “lives” in the coefficient space Rd (data dependent).
⇒ Try to show a recovery result for z0!
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19
![Page 44: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/44.jpg)
Signal Space vs. Coefficient Space
xk =∑M
m=1 sm,kam = D>sk
Let us first assume that nk = 0 (no baseline noise).
Then
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉),
where z0 = Dω0.
z0 has a (non-unique) representation in the dictionary Dwith sparse coefficients ω0.
z0 “lives” in the signal space RM (independent of specific data type).
ω0 “lives” in the coefficient space Rd (data dependent).
⇒ Try to show a recovery result for z0!
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19
![Page 45: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/45.jpg)
Signal Space vs. Coefficient Space
xk =∑M
m=1 sm,kam = D>sk
Let us first assume that nk = 0 (no baseline noise). Then
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉),
where z0 = Dω0.
z0 has a (non-unique) representation in the dictionary Dwith sparse coefficients ω0.
z0 “lives” in the signal space RM (independent of specific data type).
ω0 “lives” in the coefficient space Rd (data dependent).
⇒ Try to show a recovery result for z0!
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19
![Page 46: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/46.jpg)
Signal Space vs. Coefficient Space
xk =∑M
m=1 sm,kam = D>sk
Let us first assume that nk = 0 (no baseline noise). Then
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉),
where z0 = Dω0.
z0 has a (non-unique) representation in the dictionary Dwith sparse coefficients ω0.
z0 “lives” in the signal space RM (independent of specific data type).
ω0 “lives” in the coefficient space Rd (data dependent).
⇒ Try to show a recovery result for z0!
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19
![Page 47: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/47.jpg)
Signal Space vs. Coefficient Space
xk =∑M
m=1 sm,kam = D>sk
Let us first assume that nk = 0 (no baseline noise). Then
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉),
where z0 = Dω0.
z0 has a (non-unique) representation in the dictionary Dwith sparse coefficients ω0.
z0 “lives” in the signal space RM (independent of specific data type).
ω0 “lives” in the coefficient space Rd (data dependent).
⇒ Try to show a recovery result for z0!
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 14 / 19
![Page 48: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/48.jpg)
What Does This Mean for the LASSO?
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉) with z0 = Dω0
SPA via the LASSO
minω∈Rd
n∑k=1
(yk − 〈xk ,ω〉)2 subject to ‖ω‖1 ≤ R
Warning: The minimizers “live” in different spaces!
Warning: We neither know D nor sk , but just their product.
Idea: Apply results for the K -LASSO with K = R ·DBd1 !
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19
![Page 49: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/49.jpg)
What Does This Mean for the LASSO?
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉) with z0 = Dω0
SPA via the LASSO
minω∈R·Bd
1
n∑k=1
(yk − 〈xk ,ω〉
︸ ︷︷ ︸=〈sk ,z〉
)2
︸ ︷︷ ︸Solvable in practice!
z :=Dω↓= min
z∈R·DBd1
n∑k=1
(yk − 〈sk , z〉)2
︸ ︷︷ ︸Solvable in theory!
Warning: The minimizers “live” in different spaces!
Warning: We neither know D nor sk , but just their product.
Idea: Apply results for the K -LASSO with K = R ·DBd1 !
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19
![Page 50: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/50.jpg)
What Does This Mean for the LASSO?
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉) with z0 = Dω0
SPA via the LASSO
minω∈R·Bd
1
n∑k=1
(yk − 〈xk ,ω〉︸ ︷︷ ︸=〈sk ,z〉
)2
︸ ︷︷ ︸Solvable in practice!
z :=Dω↓= min
z∈R·DBd1
n∑k=1
(yk − 〈sk , z〉)2
︸ ︷︷ ︸Solvable in theory!
Warning: The minimizers “live” in different spaces!
Warning: We neither know D nor sk , but just their product.
Idea: Apply results for the K -LASSO with K = R ·DBd1 !
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19
![Page 51: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/51.jpg)
What Does This Mean for the LASSO?
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉) with z0 = Dω0
SPA via the LASSO
minω∈R·Bd
1
n∑k=1
(yk − 〈xk ,ω〉
︸ ︷︷ ︸=〈sk ,z〉
)2
︸ ︷︷ ︸Solvable in practice!
z :=Dω↓= min
z∈R·DBd1
n∑k=1
(yk − 〈sk , z〉)2︸ ︷︷ ︸Solvable in theory!
Warning: The minimizers “live” in different spaces!
Warning: We neither know D nor sk , but just their product.
Idea: Apply results for the K -LASSO with K = R ·DBd1 !
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19
![Page 52: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/52.jpg)
What Does This Mean for the LASSO?
yk = sign(〈xk ,ω0〉) = sign(〈sk , z0〉) with z0 = Dω0
SPA via the LASSO
minω∈R·Bd
1
n∑k=1
(yk − 〈xk ,ω〉
︸ ︷︷ ︸=〈sk ,z〉
)2
︸ ︷︷ ︸Solvable in practice!
z :=Dω↓= min
z∈R·DBd1
n∑k=1
(yk − 〈sk , z〉)2︸ ︷︷ ︸Solvable in theory!
Warning: The minimizers “live” in different spaces!
Warning: We neither know D nor sk , but just their product.
Idea: Apply results for the K -LASSO with K = R ·DBd1 !
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 15 / 19
![Page 53: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/53.jpg)
A Simplified Version of Roman Vershynin’s Result
Theorem (Plan, Vershynin ’15)
Suppose that sk ∼ N (0, IM), z0 ∈ SM−1, and the observations follow
yk = sign(〈sk , z0〉), k = 1, . . . , n.
Put µ =√
2π and assume that µz0 ∈ K , where K is convex, and
n & w(K )2.
Then, with high probability, the solution z of the K -LASSO satisfies
‖z − µz0‖2 .√
w(K)√n.
The (global) mean width for bounded K ⊂ RM is given by
w(K ) = supu∈K〈g ,u〉, where g ∼ N (0, IM).
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 16 / 19
![Page 54: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/54.jpg)
A Simplified Version of Roman Vershynin’s Result
Theorem (Plan, Vershynin ’15)
Suppose that sk ∼ N (0, IM), z0 ∈ SM−1, and the observations follow
yk = sign(〈sk , z0〉), k = 1, . . . , n.
Put µ =√
2π and assume that µz0 ∈ K , where K is convex, and
n & w(K )2.
Then, with high probability, the solution z of the K -LASSO satisfies
‖z − µz0‖2 .√
w(K)√n.
Assume that K = µR ·DBd1 ⇒ z0 = Dω0 for some ω0 ∈ R · Bd
1 .
Assume that the columns of D are normalized. Then
w(K ) . R ·√
log(d).
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 16 / 19
![Page 55: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/55.jpg)
A Recovery Guarantee for SPA
Theorem (G. ’15)
Suppose that sk ∼ N (0, IM). Let z0 ∈ SM−1 and assume that there existsR > 0 such that z0 = Dω0 for some ω0 ∈ R · Bd
1 . The observations follow
yk = sign(〈sk , z0〉) = sign(〈xk ,ω0〉), k = 1, . . . , n.
and the number of samples satisfies
n & R2 · log(d).
Then, with high probability, the solution of the LASSO
z = argminz∈R·DBd
1
n∑k=1
(yk − 〈sk , z〉)2
satisfies
‖z −√
2πz0‖2 .
(R2·log(d)
n
)1/4.
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 17 / 19
![Page 56: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/56.jpg)
A Recovery Guarantee for SPA
Theorem (G. ’15)
Suppose that sk ∼ N (0, IM). Let z0 ∈ SM−1 and assume that there existsR > 0 such that z0 = Dω0 for some ω0 ∈ R · Bd
1 . The observations follow
yk = sign(〈sk , z0〉) = sign(〈xk ,ω0〉), k = 1, . . . , n.
and the number of samples satisfies
n & R2 · log(d).
Then, with high probability, the solution of the LASSO
z = argminz∈R·DBd
1
n∑k=1
(yk − 〈sk , z〉)2
satisfies
‖z −√
2πz0‖2 .
(R2·log(d)
n
)1/4.
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 17 / 19
![Page 57: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/57.jpg)
A Recovery Guarantee for SPA
Theorem (G. ’15)
Suppose that sk ∼ N (0, IM). Let z0 ∈ SM−1 and assume that there existsR > 0 such that z0 = Dω0 for some ω0 ∈ R · Bd
1 . The observations follow
yk = sign(〈sk , z0〉) = sign(〈xk ,ω0〉), k = 1, . . . , n.
and the number of samples satisfies
n & R2 · log(d).
Then, with high probability, the solution of the LASSO
z = D · ω = D · argminω∈R·Bd
1
n∑k=1
(yk − 〈xk ,ω〉)2
satisfies
‖Dω −√
2πDω0‖2 = ‖z −
√2πz0‖2 .
(R2·log(d)
n
)1/4.
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 17 / 19
![Page 58: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/58.jpg)
Practical Relevance for MS-Data?
Extensions:I Baseline noise nk ∼ N (0, σ2Id )I Non-trivial covariance matrix, i.e., sk ∼ N (0,Σ)I Adversarial bit-flips in the model yk = sign(〈xk ,ω0〉)
How to achieve normalized columns in D?How to guarantee that R ≈
√s, i.e., s-sparse vectors are allowed?
→ Standardize the data (centering + normalizing)
Given ω, how to switch over to the signal space? (D is unknown)→ Identify supp(ω) with peaks (manual approach)
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 18 / 19
![Page 59: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/59.jpg)
Practical Relevance for MS-Data?
Extensions:I Baseline noise nk ∼ N (0, σ2Id )I Non-trivial covariance matrix, i.e., sk ∼ N (0,Σ)I Adversarial bit-flips in the model yk = sign(〈xk ,ω0〉)
How to achieve normalized columns in D?How to guarantee that R ≈
√s, i.e., s-sparse vectors are allowed?
→ Standardize the data (centering + normalizing)
Given ω, how to switch over to the signal space? (D is unknown)→ Identify supp(ω) with peaks (manual approach)
Message of this talk
An s-sparse disease fingerprint can be accuratelyrecovered from only O(s log(d)) samples!
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 18 / 19
![Page 60: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/60.jpg)
THANK YOU FORYOUR ATTENTION!
Further Reading
M. GenzelSparse Proteomics Analysis: Toward a Mathematical Foundation ofFeature Selection and Disease Classification.Master’s Thesis, 2015.
Y. Plan, R. VershyninThe generalized Lasso with non-linear observations.arXiv:1502.04071, 2015.
![Page 61: Sparse Proteomics Analysis (SPA) - TU Berlin · Sparse Proteomics Analysis (SPA) Toward a Mathematical Theory for Feature Selection from Forward Models Martin Genzel Technische Universit](https://reader030.vdocuments.site/reader030/viewer/2022040808/5e4ca6cfc14e565f1103528d/html5/thumbnails/61.jpg)
What to Do Next?
Development of an abstract framework→ What kind of properties should the dictionary D have?
Extension/generalization of the results→ More complicated models and algorithms
Numerical verification of the theory
Other examples from real-world applications→ Bio-informatics, neuro-imaging, astronomy, chemistry, . . .
Dictionary learning / Factor analysis→ What can we learn about D?
Martin Genzel Sparse Proteomics Analysis (SPA) WiCoS 2015 19 / 19