![Page 1: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/1.jpg)
Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination
John HuynhE-mail: [email protected]
Committee:
• Dr. Rosemary Renaut
• Dr. Kenneth Hoober
• Dr. Bradford Kirkman-Liff
Advisor:
• Dr. Adrienne C. Scheck
![Page 2: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/2.jpg)
Contents
Introduction Data Support Vector Machine Feature Selection Hypothesis & Experimental design Result Conclusion Future work Experience Reference
![Page 3: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/3.jpg)
Terminology Sample is data set including Gene = feature = attribute = column Example = data point = slide = array = row
x1 x2 … xj … xn Classr_1 x_11 x_12 ... x_1j ... x_1n c_1r_2 x_21 x_22 ... x_2j ... x_2n c_2… … … … … … … …r_i x_i1 x_i2 … x_ij … x_in c_i… … … … … … … …r_m x_m1 x_m2 … x_mj … x_mn c_m
![Page 4: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/4.jpg)
Meningioma Dr. Adrienne C. Scheck’s
Lab, BNI (Barrow Neurological Institute)
Meningioma: 20% of primary intracranial tumor
Mortality/Morbidity: In one series by Coke et al, the overall survival rate for all patients at 5 and 10 years were 87% and 58%, respectively.
Medial Sphenoid Wing Meningioma
![Page 5: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/5.jpg)
Meningioma …
Correlating clinical process, microarray, NMR, and FISH with WHO classification grade I, II, and III.
Tubercullum sellae meningioma
![Page 6: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/6.jpg)
Anatomy Meningioma is tumor of
arachnoid.
![Page 7: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/7.jpg)
Histology
Neuron & Purkinje cell (cerebellum) Neuroglial cells Astrocytes: nurture, support Protoplasmic astrocytes (gray matter) Fibrous astrocytes (white matter)
Oligodendrocytes: myelin, support Microglia: immune system in brain Ependymal cells: epithelium
Blood vessels
![Page 8: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/8.jpg)
Meningioma - Histopathology Meningioma: whorl-like structure +
psammoma bodies WHO grade I: benign WHO grade II: (atypical) “A
meningioma with increased mitotic activity or three or more of the following features: increased cellularity, small cells with high nucleus: cytoplasm ratio, prominent nucleoli, uninterrupted patternless or sheet-like growth, and foci of spontaneous or geographic necrosis.”
WHO grade III: (anaplastic) “A meningioma exhibiting histological features of frank malignancy far in excess of the abnormalities present in atypical meningioma.”
![Page 9: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/9.jpg)
BNI Meningioma Data
Affymetrix HG-U133 Plus 2.0 with 54,675 genes.
Small data set with many genes
Grade I II III TotalPrimary 15 7 0 22Recurrence 3 0 1 4Total 18 7 1 26
![Page 10: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/10.jpg)
BNI Meningioma Data …
Plan A: consider data as large data set Plan B: consider data as small data set
Grade Train Test TotalI 11 4 15II 5 2 7
Total 16 6 22
![Page 11: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/11.jpg)
BNI Meningioma Data …
High quality
![Page 12: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/12.jpg)
Microarray
Gene expression- Microarray Pattern of gene expressions for each tissue Oligo-microarray vs cDNA High density Fixed probe length (25) In-situ synthesis
![Page 13: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/13.jpg)
Microarray …
Microarray explores gene expression in global scale.
PM & MM
![Page 14: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/14.jpg)
Lymphoma Data
Amersham cDNA microarray with 7129 genes Tissue = bone marrow, blood ALL: acute lymphocytic leukemia AML: acute myelogenous leukemia Incidence: peak 2-3 yrs old: 80/1,000,000; 2400
new/yr/USA, 31% of all cancers
Train Test TotalALL 27 20 47AML 11 14 25Total 38 34 72
![Page 15: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/15.jpg)
Lymphoma Data …
Good quality Large sample size,
smaller feature dimension
![Page 16: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/16.jpg)
Inducer Problem The purpose of learning machine is to find the most
accurate classifier by learning in the training set and testing in the testing set.
It is the minimizing problem of the error function E in mathematics.
Let call f is learning algorithm, data points X = {x1, x2,…, xi, …, xm} in Rn, target {y1, y2, …, yi, …, ym} in Y = {-1, +1}
f: X Y xi f(xi) E = (yi - f(xi))2. Testing set requirement: the testing set must be never
seen in the training process; otherwise the correctness of the testing phase is unexpected high.
![Page 17: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/17.jpg)
Support Vector Machine
Map data into the feature space
Learn in the feature space
Return the result to the output space
Learning function f(xi) = xi w + b
f(xi) > 0 for yi = +1, f(xi) < 0 for yi = -1 f(xi) = 0 for decision
boundary
Input space
Feature space
Output space
![Page 18: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/18.jpg)
SVM Characteristics
Maximum margin Low computer cost:
Kernel function costs O(n). Training cost: the worst case
costs O(nsv3 + nsv
2m + nsvmn); the best case costs O(nm2).
Testing cost: O(n).
![Page 19: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/19.jpg)
Linear SVM - Separable Case
No kernel = scalar dot product
Margin = 2/|w| minimizing w2
Constraints (xi w+b)yi >0
![Page 20: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/20.jpg)
Linear SVM - Non-Separable Case
Introduce slacks ξi to adjust the choosing of support vector when needed.
This means adding a constraint C on the Lagrangean multipliers
C = 100 in our experiment.
![Page 21: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/21.jpg)
Non-Linear SVM
There is no linear decision boundary in the input space
![Page 22: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/22.jpg)
Non-Linear Support Vector Machine Introduce kernel function to map data into Euclidean high dimensional space: dot product.
![Page 23: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/23.jpg)
Non-Linear Support Vector Machine
Now the data and weight are in the hyperspace.
Training and testing processes are in the high dimensional space
![Page 24: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/24.jpg)
Problem of Microarray Data
Instance space F1x F2 x … x Fi x … x Fn
The training set must be a large enough subset of instance space.
Over-fitting problem of small data set: the inducer performs well in training set, but acts poorly in test set.
The computational cost of high dimensional data is so high (n = 54675).
Multiple testing correction: FDR, SAM, Classical analysis methods are not suitable.
![Page 25: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/25.jpg)
Feature Selection Benefits of feature selection are reducing
Computer cost Over-fitting
Feature selection actually is a search algorithm in the feature space to find the optimal feature subset.
Given an inducer I, and a data set D with features X1, …, Xi, …, Xn from a distribution D over the labeled instance space, an optimal feature subset, Xopt, is a subset of the features such that the accuracy of the induced classifier C = I(D) is maximal (Kohavi97).
![Page 26: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/26.jpg)
Feature Selection: How?
Filter method vs wrapper method
Feature ranking criteria Correlation coefficient Weight
![Page 27: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/27.jpg)
Recursive Feature Elimination
RFE is a top-down (backward) wrapper using weight as feature ranking criterion.
Eliminate One feature in every loop: slow A subset in every loop: fast
Are they the same optimal subsets? Is the feature ranking criteria are the same?
![Page 28: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/28.jpg)
Feature Selection Meaning
Create nested subsets Let define
Rate of elimination Surviving subset
Note that the feature selection module includes an inducer so the training set must be never seen in both Feature selection module Evaluation module
(Kohavi97)
![Page 29: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/29.jpg)
Full Two Factorial Experiment Design
The evaluation cost is the accuracy. The evaluation methods: independent test and
cross-validation. The inducer is SVM for both feature selection
and evaluation (Guyon02). The factor A (row) is the rate of elimination. The factor B (column) is the surviving subset
![Page 30: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/30.jpg)
Software Design
Preprocessing data: linear normalization + log2 transformation (prep.java)
SVM, feature selection and evaluation: Matlab 6.5R13
![Page 31: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/31.jpg)
Result: Lymphoma
The optimal subset is 32 genes.
![Page 32: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/32.jpg)
Result: Lymphoma Box Plots
Box Plots
![Page 33: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/33.jpg)
Result: Lymphoma ANOVA
Tsuc
Vsuc
![Page 34: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/34.jpg)
Result: Meningioma
The optimal subset is 4.
![Page 35: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/35.jpg)
Result: Meningioma Box Plots
Small Plan: 4 Large Plan: 2
Large
Small
![Page 36: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/36.jpg)
Result: Meningioma ANOVA
Correct choice is 4.
Index Probe37881 238018_at22501 222608_s_at50198 1564431_a_at16979 21552_x_at
![Page 37: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/37.jpg)
Conclusion
No interaction between the rate of elimination and the feature optimal subset
Small data set: rely on cross-validation
![Page 38: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/38.jpg)
Future Works
More published data set: large + small, difficult + easy
How small is small? Evaluation method for small data set: master
gene lists + LOOCV Over-fitting and cross-validation
![Page 39: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/39.jpg)
Experience
Not all the data mining task will be success. Business focus: communication, learning, negotiation, team
work, leadership, … Understand and live with data: a high dimensional small data set Never alternate the data in preprocessing process (time cost) Experimental design: good planning Observation + Think + Reaction = Strategy Loop, deal with the
fact, not with who. Repeatable: Document experiment results and analysis Welcome new idea: good + bad; read, read, and read “Never seen” rule of test data, evaluation algorithm, over-fitting Feature selection – SVM – Software
![Page 40: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/40.jpg)
References (Blum) Avrim L. Blum and Pat Langley, Selection of Relevant Features and
Examples in Machine Learning, http://citeseer.ist.psu.edu/blum97selection.html.
(Burges98) Christopher J.C. Burges, A Turtorial on Support Vector Machines for Pattern Recognition, (1998), Web-print: http://citeseer.ist.psu.edu/397919.html.
(Golub99) Golub et al, Molecular Classication of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 286 (1999), 531-7, http://www.broad.mit.edu/mpr/publications/projects/Leukemia/Golub et al 1999.pdf.
(Guyon02) Isabelle Guyon et al., Gene Selection for Cancer Classication using Support Vector Machines, Machine Learning 46 (2002), no. 1-3, 389-422, Web-print: http://citeseer.ist.psu.edu/guyon00gene.html.
(Gunn98) Steve R. Gunn, Support Vector Machines for Classification and Regression, (1998), http://www.ecs.soton.ac.uk/~srg/publications/pdf/SVM.pdf
(Kohavi97) Ron Kohavi and George H. John, Wrappers for Feature Subset Selection, Artifcial Intelligence 97 (97), 273-324.
(Soroin03) Soroin Dr¸aghici, Data Analysis Tools for DNA Microarrays, Chapman and Hall/CRC, 2003.
WHO Classification http://neurosurgery.mgh.harvard.edu/newwhobt.htm
![Page 41: Gene Selection Using Support Vector Machine Recursive ...cbs/projects/2004_presentation_huynh_john.pdfEvaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination](https://reader034.vdocuments.site/reader034/viewer/2022043018/5f3aafb65ee5f856d60233e8/html5/thumbnails/41.jpg)
Question?