cot6930 course project. outline gene selection sequence alignment
DESCRIPTION
Why Gene Selection Identify marker genes that characterize different tumor status. Many genes are redundant and will introduce noise that lower performance. Can eventually lead to a diagnosis chip. (“breast cancer chip”, “liver cancer chip”)TRANSCRIPT
![Page 1: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/1.jpg)
COT6930 Course Project
![Page 2: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/2.jpg)
Outline
• Gene Selection• Sequence Alignment
![Page 3: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/3.jpg)
Why Gene Selection
• Identify marker genes that characterize different tumor status.
• Many genes are redundant and will introduce noise that lower performance.
• Can eventually lead to a diagnosis chip. (“breast cancer chip”, “liver cancer chip”)
![Page 4: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/4.jpg)
Why Gene Selection
![Page 5: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/5.jpg)
Gene Selection
• Methods fall into three categories:– Filter methods– Wrapper methods– Embedded methods
Filter methods are simplest and most frequently used in the literature
Wrapper methods are likely the most accurate ones
![Page 6: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/6.jpg)
Filter Method
• Features (genes) are scored according to the evidence of predictive power and then are ranked.
• Top s genes with high score are selected and used by the classifier.– Scores: t-statistics, F-statistics, signal-noise ratio, …– The # of features selected, s, is then determined by cross
validation.• Advantage: Fast and easy to interpret.
![Page 7: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/7.jpg)
Good versus bad features
![Page 8: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/8.jpg)
Filter Method: Problem
• Genes are considered independently.– Redundant genes may be included.– Some genes jointly with strong discriminant
power but individually are weak will be ignored.
• Good single features do not necessarily form a good feature set
• The filtering procedure is independent to the classifying method– Features selected can be applied to all types
of classifying methods
![Page 9: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/9.jpg)
Wrapper Method
• Iterative search: many “feature subsets” are scored base on classification performance and the best is used.– Select a good subset of features
• Subset selection: Forward selection, backward selection, their combinations.– Exhaustive searching is impossible.– Greedy algorithm are used instead.
![Page 10: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/10.jpg)
Wrapper Method: Problem
• Computationally expensive– For each feature subset considered, the
classifier is built and evaluated.• Exhaustive searching is impossible
– Greedy search only.• Easy to overfit.
![Page 11: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/11.jpg)
Embedded Method
• Attempt to jointly or simultaneously train both a classifier and a feature subset.
• Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.
• Intuitively appealing
![Page 12: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/12.jpg)
Relief-F• Relief-F a filter approach for feature selection
– Relief
![Page 13: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/13.jpg)
Relief-F• Original Relief is only able to handle binary classification problem.
Extension was made to handle multiple-class problem
![Page 14: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/14.jpg)
Relief-F
• Categorical attributes
• Numerical attributes
![Page 15: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/15.jpg)
Relief-F Problem
• Time Complexity– m×(m×a+c×m×a+a)=O(cm2a)– Assume m=100, c=3, a=10,000– Time complexity 300×106
• Only considers one single attribute, cannot select a subset of “good” genes
![Page 16: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/16.jpg)
Solution: Parallel Relief-F
• Version 1: – Clusters runs ReliefF in parallel, and updated
weighted weight values are collected at the master.
– Theoretical time complexity O(cm2a/p)• P is the # of clusters
![Page 17: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/17.jpg)
![Page 18: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/18.jpg)
Parallel Relief-F
• Version 2:– Clusters runs ReliefF in parallel, and each
cluster directly update the global weight values.
– Each cluster also considers the current weight values to select nearest neighbour instances
– Theoretical time complexity O(cm2a/p)• p is the # of clusters
![Page 19: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/19.jpg)
![Page 20: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/20.jpg)
Parallel Relief-F
• Version 3– Consider selecting a subset of important
features– Comparing the difference between
including/excluding a specific feature, and understand the importance of a gene with respect to an existing subset of features
– Discussion in private!
![Page 21: COT6930 Course Project. Outline Gene Selection Sequence Alignment](https://reader034.vdocuments.site/reader034/viewer/2022052606/5a4d1acf7f8b9ab059970c1f/html5/thumbnails/21.jpg)
Outline
• Gene Selection• Sequence Alignment
– Given a dataset D with N=1000 sequences (e.g., 1000 each)
– Given an input x, – Do pair-wise global sequence alignment
between x and all sequences D• Dispatch jobs to clusters• And aggregate the results