bioinformatics applications and feature selection for svms
TRANSCRIPT
![Page 1: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/1.jpg)
Class 23, 2001
CBCl/AI MIT
Bioinformatics Applications and Feature Selection for SVMs
S. Mukherjee
![Page 2: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/2.jpg)
Class 23, 2001
CBCl/AI MIT
Outline
I. Basic Molecular biologyII. Some Bioinformatics problemsIII. Microarray technology
a. Purposeb. cDNA and Oligonucleotide arraysc. Yeast experiment
IV. Cancer classification using SVMsV. Rejects and Confidence of classificationVI. Feature Selection for SVMs
a. Leave-one-out boundsb. The algorithm
VII Results on several datasets
![Page 3: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/3.jpg)
Class 23, 2001
CBCl/AI MIT
What is Bioinformatics
Pre 1995Application of computing technology to providing statistical anddatabase solutions to problems in molecular biology.
Post 1995Defining and addressing problems in molecular biology using methodologies from statistics and computer science.
The genome project, genome wide analysis/screening of disease,genetic regulatory networks, analysis of expression data.
![Page 4: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/4.jpg)
Class 23, 2001
CBCl/AI MITSome Basic Molecular Biology
CGAACAAACCTCGAACCTGCTDNA:
mRNA:
Polypeptide:
Translation
Transcription
GCU UGU UUA CGA
Ala Cys Leu Arg
![Page 5: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/5.jpg)
Class 23, 2001
CBCl/AI MITExamples of Problems
Gene sequence problems: Given a DNA sequence state which sectionsare coding or noncoding regions. Which sections are promoters etc...
Protein Structure problems: Given a DNA or amino acid sequence state what structure the resulting protein takes.
Gene expression problems: Given DNA/gene microarray expression data infer either clinical or biological class labels or genetic machinery that gives rise to the expression data.
Protein expression problems: Study expression of proteins and their function.
![Page 6: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/6.jpg)
Class 23, 2001
CBCl/AI MITMicroarray Technology
Basic idea:
The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA.Measuring amount particular mRNA gives measure ofamount of corresponding protein.Copies of mRNA is expression of a gene.
Microarray technology allows us to measure the expressionof thousands of genes at once.
Measure the expression of thousands of genesunder different experimental conditions and ask what isdifferent and why.
![Page 7: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/7.jpg)
Class 23, 2001
CBCl/AI MITOligo vs cDNA arrays
Lockhart and Winzler 2000
![Page 8: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/8.jpg)
Class 23, 2001
CBCl/AI MITA DNA Microarray Experiment
![Page 9: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/9.jpg)
Class 23, 2001
CBCl/AI MITCancer Classification
38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes)
34 examples to test classifier
Results: 33/34 correct
d perpendicular distancefrom hyperplane d
Test data
![Page 10: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/10.jpg)
Class 23, 2001
CBCl/AI MITGene expression and Coregulation
![Page 11: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/11.jpg)
Class 23, 2001
CBCl/AI MITNonlinear classifier
![Page 12: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/12.jpg)
Class 23, 2001
CBCl/AI MITNonlinear SVM
Nonlinear SVM does not help when using all genes but does help whenremoving top genes, ranked by Signal to Noise (Golub et al).
![Page 13: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/13.jpg)
Class 23, 2001
CBCl/AI MITRejections
Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes
Need to introduce concept of rejects to SVM
g1
g2
Normal
Cancer
Reject
![Page 14: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/14.jpg)
Class 23, 2001
CBCl/AI MITRejections
![Page 15: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/15.jpg)
Class 23, 2001
CBCl/AI MITEstimating a CDF
![Page 16: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/16.jpg)
Class 23, 2001
CBCl/AI MITThe Regularized Solution
![Page 17: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/17.jpg)
Class 23, 2001
CBCl/AI MITRejections for SVM
95% confidence or p = .05 d = .107
P(c=1 | d)
.95
1/d
![Page 18: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/18.jpg)
Class 23, 2001
CBCl/AI MITResults with rejections
Results: 31 correct, 3 rejected of which 1 is an error
d
Test data
![Page 19: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/19.jpg)
Class 23, 2001
CBCl/AI MITWhy Feature Selection
• SVMs as stated use all genes/features
• Molecular biologists/oncologists seem to be conviced that only a small subset of genes are responsible for particular biological properties, so they want which genes are are most important in discriminating
• Practical reasons, a clinical device with thousands of genes is not financially practical
•Possible performance improvement
![Page 20: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/20.jpg)
Class 23, 2001
CBCl/AI MITResults with Gene Selection
AML vs ALL: 40 genes 34/34 correct, 0 rejects.5 genes 31/31 correct, 3 rejects of which 1 is an error.
d
Test data
d
Test data
B vs T cells for AML: 10 genes 33/33 correct, 0 rejects.
![Page 21: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/21.jpg)
Class 23, 2001
CBCl/AI MITLeave-one-out Procedure
![Page 22: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/22.jpg)
Class 23, 2001
CBCl/AI MITThe Basic Idea
Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound.
When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables.
The rescaling can be done in the input space or in a “Principal Components” space.
![Page 23: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/23.jpg)
Class 23, 2001
CBCl/AI MITPictorial Demonstration
Rescale features to minimize the LOO bound R2/M2
x2
R2/M2 =1
M = R
x2
x1
R2/M2 >1
M
R
![Page 24: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/24.jpg)
Class 23, 2001
CBCl/AI MITSVM Functional
To the SVM classifier we add an extra scaling parameters for feature selection:
where the parameters α, b are computed by maximizing the the following functional, which is equivalent to maximizing the margin:
![Page 25: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/25.jpg)
Class 23, 2001
CBCl/AI MITRadius Margin Bound
![Page 26: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/26.jpg)
Class 23, 2001
CBCl/AI MITJaakkola-Haussler Bound
![Page 27: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/27.jpg)
Class 23, 2001
CBCl/AI MITSpan Bound
![Page 28: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/28.jpg)
Class 23, 2001
CBCl/AI MITThe Algorithm
![Page 29: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/29.jpg)
Class 23, 2001
CBCl/AI MITComputing Gradients
![Page 30: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/30.jpg)
Class 23, 2001
CBCl/AI MITToy Data
Linear problem with 6 relevant dimensions of 202
Nonlinear problem with 2 relevant dimensions of 52
![Page 31: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/31.jpg)
Class 23, 2001
CBCl/AI MITFace Detection
On the CMU testset consisting of 479 faces and 57,000,000 non-faces we compare ROC curves obtained for different number of selected features.We see that using more than 60 features does not help.
![Page 32: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/32.jpg)
Class 23, 2001
CBCl/AI MITMolecular Classification of Cancer
Dataset TotalSamples
Class 0 Class 1
LeukemiaMorphology (train)
38 27ALL
11AML
LeukemiaMorpholgy (test)
34 20ALL
14AML
Leukemia Lineage(ALL)
23 15B-Cell
8T-Cell
Lymphoma Outcome(AML)
15 8Low risk
7High risk
Dataset TotalSamples
Class 0 Class 1
LymphomaMorphology
77 19FSC
58DLCL
LymphomaOutcome
58 20Low risk
14High risk
Brain Morphology 41 14Glioma
27MD
Brain Outcome 50 38Low risk
12High risk
![Page 33: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/33.jpg)
Class 23, 2001
CBCl/AI MITMorphology Classification
Dataset Algorithm TotalSamples
Totalerrors
Class 1errors
Class 0errors
NumberGenes
SVM 35 0/35 0/21 0/14 40
WV 35 2/35 1/21 1/14 50
LeukemiaMorphology (trest)AML vs ALL
k-NN 35 3/35 1/21 2/14 10
SVM 23 0/23 0/15 0/8 10
WV 23 0/23 0/15 0/8 9
Leukemia Lineage(ALL)B vs T
k-NN 23 0/23 0/15 0/8 10
SVM 77 4/77 2/32 2/35 200
WV 77 6/77 1/32 5/35 30
LymphomaFS vs DLCL
k-NN 77 3/77 1/32 2/35 250
SVM 41 1/41 1/27 0/14 100WV 41 1/41 1/27 0/14 3
BrainMD vs Glioma
k-NN 41 0/41 0/27 0/14 5
![Page 34: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/34.jpg)
Class 23, 2001
CBCl/AI MITOutcome Classification
Dataset Algorithm TotalSamples
Totalerrors
Class 1errors
Class 0errors
NumberGenes
SVM 58 13/58 3/32 10/26 100
WV 58 15/58 5/32 10/26 12
Lymphoma
LBC treatmentoutcome
k-NN 58 15/58 8/32 7/26 15
SVM 50 7/50 6/12 1/38 50
WV 50 13/50 6/12 7/38 6
Brain
MD treatmentoutcome
k-NN 50 10/50 6/12 4/38 5
![Page 35: Bioinformatics Applications and Feature Selection for SVMs](https://reader030.vdocuments.site/reader030/viewer/2022012713/61ac0bc684f59e2b2646a307/html5/thumbnails/35.jpg)
Class 23, 2001
CBCl/AI MITOutcome Classification
Error rates ignore temporal information such as when a patient dies. Survivalanalysis takes temporal information into account. The Kaplan-Meier survivalplots and statistics for the above predictions show significance.
0 20 40 60 80 100 120
0.0
0.2
0.4
0.6
0.8
1.0
p-val = 0.0015
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
p-val = 0.00039
Lymphoma Medulloblastoma