cot6930 course project. outline gene selection sequence alignment

COT6930 Course Project

Outline

• Gene Selection• Sequence Alignment

Why Gene Selection

• Identify marker genes that characterize different tumor status.

• Many genes are redundant and will introduce noise that lower performance.

• Can eventually lead to a diagnosis chip. (“breast cancer chip”, “liver cancer chip”)

Why Gene Selection

Gene Selection

• Methods fall into three categories:– Filter methods– Wrapper methods– Embedded methods

Filter methods are simplest and most frequently used in the literature

Wrapper methods are likely the most accurate ones

Filter Method

• Features (genes) are scored according to the evidence of predictive power and then are ranked.

• Top s genes with high score are selected and used by the classifier.– Scores: t-statistics, F-statistics, signal-noise ratio, …– The # of features selected, s, is then determined by cross

validation.• Advantage: Fast and easy to interpret.

Good versus bad features

Filter Method: Problem

• Genes are considered independently.– Redundant genes may be included.– Some genes jointly with strong discriminant

power but individually are weak will be ignored.

• Good single features do not necessarily form a good feature set

• The filtering procedure is independent to the classifying method– Features selected can be applied to all types

of classifying methods

Wrapper Method

• Iterative search: many “feature subsets” are scored base on classification performance and the best is used.– Select a good subset of features

• Subset selection: Forward selection, backward selection, their combinations.– Exhaustive searching is impossible.– Greedy algorithm are used instead.

Wrapper Method: Problem

• Computationally expensive– For each feature subset considered, the

classifier is built and evaluated.• Exhaustive searching is impossible

– Greedy search only.• Easy to overfit.

Embedded Method

• Attempt to jointly or simultaneously train both a classifier and a feature subset.

• Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.

• Intuitively appealing

Relief-F• Relief-F a filter approach for feature selection

– Relief

Relief-F• Original Relief is only able to handle binary classification problem.

Extension was made to handle multiple-class problem

Relief-F

• Categorical attributes

• Numerical attributes

Relief-F Problem

• Time Complexity– m×(m×a+c×m×a+a)=O(cm2a)– Assume m=100, c=3, a=10,000– Time complexity 300×106

• Only considers one single attribute, cannot select a subset of “good” genes

Solution: Parallel Relief-F

• Version 1: – Clusters runs ReliefF in parallel, and updated

weighted weight values are collected at the master.

– Theoretical time complexity O(cm2a/p)• P is the # of clusters

Parallel Relief-F

• Version 2:– Clusters runs ReliefF in parallel, and each

cluster directly update the global weight values.

– Each cluster also considers the current weight values to select nearest neighbour instances

– Theoretical time complexity O(cm2a/p)• p is the # of clusters

Parallel Relief-F

• Version 3– Consider selecting a subset of important

features– Comparing the difference between

including/excluding a specific feature, and understand the importance of a gene with respect to an existing subset of features

– Discussion in private!

Outline

• Gene Selection• Sequence Alignment

– Given a dataset D with N=1000 sequences (e.g., 1000 each)

– Given an input x, – Do pair-wise global sequence alignment

between x and all sequences D• Dispatch jobs to clusters• And aggregate the results

cot6930 course project. outline gene selection sequence alignment

Documents