a short and simple introduction to linear discriminants (with almost no math) jennifer listgarten,...

26
A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002.

Upload: erik-ferguson

Post on 29-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

A Short and Simple Introduction to Linear Discriminants (with almost no math)

Jennifer Listgarten, November 2002.

Page 2: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Introduction

• A linear discriminant is a group of mathematical models that allows us to classify data (like microarray) into preset groups (eg. cancer vs. non-cancer, metastatic vs. non metastatic, respond well to drug vs. poorly to drug)

• ‘Discriminant’ simply means that it has the ability to discriminate between two classes.

• The meaning of the word ‘linear’ will become clearer later.

Page 3: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Motivation I

• Spoke previously at great length about common clustering methods for microarray data (unsupervised learning).

• Supervised techniques are much more powerful/useful.

• Linear discriminants (supervised method) are one of the older, well studied supervised techniques, both in traditional statistics and machine learning.

Page 4: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Motivation II

• Linear discriminants are widely used today in many application domains, including the modeling of various types of biological data.

• Many classes or sub-classes of techniques are actually linear discriminants (eg. Artificial Neural Networks, Fisher Discriminant, Support Vector Machine and many more).

• Provides very general framework upon which much has been built i.e. can extend to very sophisticated, robust techniques.

Page 5: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Patient_X= (gene_1, gene_2, gene_3, …, gene_N)

N (number of dimensions) is normally larger than 2, so we can’t visualize the data.

Cancerous

Healthy

eg. Classifying Cancer Patients vs. Healthy Patients from Microarray

Page 6: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

eg. Classifying Cancer Patients vs. Healthy Patients from Microarray

Cancerous

Healthy

Gene_1 expression level

For simplicity, pretend that we are only looking at expression levels of 2 genes.

-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Up-regulated

Down-regulated

Page 7: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

eg. Classifying Cancer Patients vs. Healthy Patients from Microarray

Cancerous

Healthy

Gene_1 expression level

Question:

How can we build a classifier for this data?

-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Page 8: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

eg. Classifying Cancer Patients vs. Healthy Patients from Microarray

Cancerous

Healthy

Gene_1 expression level

Simple Classification Rule:IF gene_1 <0 AND gene_2 <0THEN person=healthy

IF gene_1 >0 AND gene_2 >0THEN person=cancerous

-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Page 9: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

eg. Classifying Cancer Patients vs. Healthy Patients from Microarray

Simple Classification Rule:IF gene_1 <0 AND gene_2 <0 AND

… gene 5000 < Y

THEN person=healthy

IF gene_1 >0 AND gene_2 >0

… gene 5000 >WTHEN person=cancerous

If we move away from our simple example with 2 genes to a realistic case with say 5000 genes, then

1. What will these rules look like?

2. How will we find them?

Gets a little complicated, unwieldy…

Page 10: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

eg. Classifying Cancer Patients vs. Healthy Patients from Microarray

Cancerous

Healthy

Gene_1 expression level-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Reformulate the previous rule

SIMPLE RULE:

•If data point lies to the ‘left’ of the line, then ‘healthy’.

•If data point lies to ‘right’ of line then ‘cancerous’

It is easier to generalize this line to 5000 genes than it is a list of rules. Also easier to solve mathematically.

Page 11: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

More Than 2 Genes (dimensions) ? Easy to Extend

Cancerous

Healthy

-5

0 5-5

0

5

•Line in 2D: x1C1 + x2C2 = T

•If we had 3 genes, and needed to build a ‘line’ in 3-dimensional space, then we would be seeking a plane.

Plane in 3D: x1C1 + x2C2 + x3C3 = T

•If we were looking in more than 3 dimensions, the ‘plane’ is called a hyperplane. A hyperplane is simply a generalization of a plane to dimensions higher than 3.

Hyperplane in N-dimensions: x1C1 + x2C2 + x3C3 + … + xNCN = T

Page 12: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

eg. Classifying Cancer Patients vs. Healthy Patients from Microarray

Cancerous

Healthy

Gene_1 expression level-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Why is it called ‘linear’?

The rule of ‘which side is the point on’, looks, mathematically like:

gene1*C1 + gene2*C2 > Tthen cancer

gene1*C1 + gene2*C2 < Tthen healthy

It is linear in the input (the gene expression levels).

<T

>T

Page 13: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Linear Vs. Non-Linear

gene1*C1 + gene2*C2 > T

gene1*C1 + gene2*C2 < T

1/[1+exp-(gene1*C1 + gene2*C2 +T)] < 0

1/[1+exp-(gene1*C1 + gene2*C2 +T)] > 0

gene12*C1 + gene2*C2 > T

gene12*C1 + gene2*C2 < T

gene1*gene2*C > T

gene1*gene2*C < T

‘logistic’ linear discriminant

Mathematically, linear problems are generally much easier to solve than non-linear problems.

Page 14: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

-5 0 5-5

0

5

There are actually many (infinite) lines that ‘properly’ divide the points.

Which is the correct one?

Back to our Linear Discriminant

Page 15: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

-5

0 5-5

0

5

margin

One solution (that SVMs use):

1. Find line that has the all data points on the proper side.

2. Of all lines that satisfy (1), find the one that maximizes the ‘margin’ (smallest distance between any point and line).

3. This is called ‘Constrained Optimization’ in mathematics.

-5

0 5-5

0

5

smaller marginlargest margin

margin

Page 16: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

• In general, the line that you end up with depends on some criteria, defined by the ‘Objective Function’ (for SVM, the margin)

• An ‘Objective Function’ is chosen by the modeler, and varies depending on exactly what the modeler is trying to achieve or thinks will work well (eg margin, posterior probabilities, sum of squares

error, small weight vector).

• The function usually has a theoretical foundation (eg. risk minimization, maximum

likelihood/gaussian processes/zero mean gaussian noise).

Obtaining Different ‘Lines’:Objective Functions

Page 17: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

What if the data looked like this?

Cancerous

Healthy

Gene_1 expression level-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

How could we build a suitable line that divides the data nicely?

Depends…

•Is it just a few points that are small ‘outliers’?

•Or is the data simply not amenable to this kind of classification?

Page 18: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

A few outliers – probably can still find a ‘good’ line.

Almost linearly separable data.

Not linearly separable data.

Inherently, the data cannot be separated by any one line.

Cancerous

Healthy

Cancerous

Healthy

Linearly separable data.

Can make a great classifier.

Page 19: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Cancerous

Healthy

-5

0 5-5

0

5

Not linearly separable data.

Inherently, the data cannot be separated by any one line.

Cancerous

Healthy

•If we allow the model to have more than one line (or hyperplane), then maybe we can still form a nice model.

•Much more complicated.

•This is one thing that neural networks allow us to do: combine linear discriminants together to form a single classifier (no longer a linear classifier).

•No time to delve further during this talk.

Page 20: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

-5

0 5-5

0

5

Not linearly separable data.

Now what??

Even with many lines it would be extremely difficult to build a good classifier.

Page 21: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

0 5

Not linearly separable data.

Need to transform the coordinates: polar coordinates, Principal Components coordinates, kernel transformation into higher dimensional space (support vector machines).

Distance from center (radius)

Ang

ular

de g

ree

( pha

s e)

Linearly separable data.

polar coordinates

Sometimes Need to Transform the Data

Page 22: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Caveats• May need to find a subset of the data that is linearly separable

(called feature selection). • Feature selection is what we call in computer science, an NP-

complete problem, which means, in layman’s terms: impossible to solve exactly. Feature selection is an open research problem.

• There are a spate of techniques that give you approximate solutions to feature selection.

• Features selection is mandatory in microarray expression experiments because there is so much noisy, irrelevant data.

• Also, with microarray data, there is much missing data – introduces difficulties.

Page 23: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Other Biological Applications

• Gene finding in DNA: (input is part of DNA strand, output is whether or not nucleotide at centre is inside of a gene).

• Sequence-based gene classification: the input is a gene sequence, output is a functional class.

• Protein secondary structure prediction: input is a sequence of amino acids, output is the local secondary structure.

• Protein localization in cell: the input is an amino acid sequence, the output is position in the cell (eg. nucleus, membrane, etc.)

Taken from Introduction to Support Vector Machines and Applications to Computational Biology, Jean Philippe Vert

Page 24: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Wrap-Up

• Intuitive feel for linear discriminants.• Widely applicable technique – for many

problems in Polyomx and many other areas.• Difficulties: missing data, feature selection.• Have used linear discriminants for our SNP

data and microarray data.

If interested in knowing more, great book:Neural Networks for Pattern Recognition, Christopher Bishop, 1999.

Page 25: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Minimize objective function 1. Exact solution via matrix algebra since here E is convex.

2. Iterative algorithms (gradient descent, conjugate gradient, Newton’s method, etc.) for cases where E may not be convex.

Finding the Equation of the Linear Discriminant(How a Single Layer Neural Network Might Do It)

0)( wy T xwx

w

0)( xy0)( xy

0)( xy

The discriminant function:

Eg. Sum-of-squares error function (more for regression):

N

n

nnT twE1

20 ))()( xww

0w

Kw

E

w

E

w

EE ,...,,)(

21

})1,1{( nt

Can regularize by adding in ||w||2 to E.

Page 26: A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002

Minimize ||w||2 subject to the following constraints:

0)( wy T xwx

||||||||0

ww

xw wT

0)( xy0)( xy

0)( xy

The discriminant function:

The margin is given by:

})1,1{( nt

||||||||0

ww

xw wT

Finding the Equation of the Linear Discriminant(How an SVM would do it.)

01)(,...1 0 wtKi i xwT

Use Lagrange Multipliers

]1)([||||),,( 01

20

wtwL i

N

ii xwwλw T