a short and simple introduction to linear discriminants (with almost no math) jennifer listgarten,...

A Short and Simple Introduction to Linear Discriminants (with almost no math)

Jennifer Listgarten, November 2002.

Introduction

• A linear discriminant is a group of mathematical models that allows us to classify data (like microarray) into preset groups (eg. cancer vs. non-cancer, metastatic vs. non metastatic, respond well to drug vs. poorly to drug)

• ‘Discriminant’ simply means that it has the ability to discriminate between two classes.

• The meaning of the word ‘linear’ will become clearer later.

Motivation I

• Spoke previously at great length about common clustering methods for microarray data (unsupervised learning).

• Supervised techniques are much more powerful/useful.

• Linear discriminants (supervised method) are one of the older, well studied supervised techniques, both in traditional statistics and machine learning.

Motivation II

• Linear discriminants are widely used today in many application domains, including the modeling of various types of biological data.

• Many classes or sub-classes of techniques are actually linear discriminants (eg. Artificial Neural Networks, Fisher Discriminant, Support Vector Machine and many more).

• Provides very general framework upon which much has been built i.e. can extend to very sophisticated, robust techniques.

Patient_X= (gene_1, gene_2, gene_3, …, gene_N)

N (number of dimensions) is normally larger than 2, so we can’t visualize the data.

Cancerous

Healthy

eg. Classifying Cancer Patients vs. Healthy Patients from Microarray


Cancerous

Healthy

Gene_1 expression level

For simplicity, pretend that we are only looking at expression levels of 2 genes.

-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Up-regulated

Down-regulated


Cancerous

Healthy


Question:

How can we build a classifier for this data?

-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel


Cancerous

Healthy


Simple Classification Rule:IF gene_1 <0 AND gene_2 <0THEN person=healthy

IF gene_1 >0 AND gene_2 >0THEN person=cancerous

-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel


Simple Classification Rule:IF gene_1 <0 AND gene_2 <0 AND

… gene 5000 < Y

THEN person=healthy

IF gene_1 >0 AND gene_2 >0

… gene 5000 >WTHEN person=cancerous

If we move away from our simple example with 2 genes to a realistic case with say 5000 genes, then

1. What will these rules look like?

2. How will we find them?

Gets a little complicated, unwieldy…


Cancerous

Healthy

Gene_1 expression level-5 0 5

-5

0

5

Gen

e_2

expr

essi

on le

vel

Reformulate the previous rule

SIMPLE RULE:

•If data point lies to the ‘left’ of the line, then ‘healthy’.

•If data point lies to ‘right’ of line then ‘cancerous’

It is easier to generalize this line to 5000 genes than it is a list of rules. Also easier to solve mathematically.

More Than 2 Genes (dimensions) ? Easy to Extend

Cancerous

Healthy

-5

0 5-5

0

5

•Line in 2D: x1C1 + x2C2 = T

•If we had 3 genes, and needed to build a ‘line’ in 3-dimensional space, then we would be seeking a plane.

Plane in 3D: x1C1 + x2C2 + x3C3 = T

•If we were looking in more than 3 dimensions, the ‘plane’ is called a hyperplane. A hyperplane is simply a generalization of a plane to dimensions higher than 3.

Hyperplane in N-dimensions: x1C1 + x2C2 + x3C3 + … + xNCN = T


Cancerous

Healthy


-5

0

5

Gen

e_2

expr

essi

on le

vel

Why is it called ‘linear’?

The rule of ‘which side is the point on’, looks, mathematically like:

gene1*C1 + gene2*C2 > Tthen cancer

gene1*C1 + gene2*C2 < Tthen healthy

It is linear in the input (the gene expression levels).

<T

>T

Linear Vs. Non-Linear

gene1*C1 + gene2*C2 > T

gene1*C1 + gene2*C2 < T

1/[1+exp-(gene1*C1 + gene2*C2 +T)] < 0

1/[1+exp-(gene1*C1 + gene2*C2 +T)] > 0

gene12*C1 + gene2*C2 > T

gene12*C1 + gene2*C2 < T

gene1*gene2*C > T

gene1*gene2*C < T

‘logistic’ linear discriminant

Mathematically, linear problems are generally much easier to solve than non-linear problems.

-5 0 5-5

0

5

There are actually many (infinite) lines that ‘properly’ divide the points.

Which is the correct one?

Back to our Linear Discriminant

-5

0 5-5

0

5

margin

One solution (that SVMs use):

1. Find line that has the all data points on the proper side.

2. Of all lines that satisfy (1), find the one that maximizes the ‘margin’ (smallest distance between any point and line).

3. This is called ‘Constrained Optimization’ in mathematics.

-5

0 5-5

0

5

smaller marginlargest margin

margin

• In general, the line that you end up with depends on some criteria, defined by the ‘Objective Function’ (for SVM, the margin)

• An ‘Objective Function’ is chosen by the modeler, and varies depending on exactly what the modeler is trying to achieve or thinks will work well (eg margin, posterior probabilities, sum of squares

error, small weight vector).

• The function usually has a theoretical foundation (eg. risk minimization, maximum

likelihood/gaussian processes/zero mean gaussian noise).

Obtaining Different ‘Lines’:Objective Functions

What if the data looked like this?

Cancerous

Healthy


-5

0

5

Gen

e_2

expr

essi

on le

vel

How could we build a suitable line that divides the data nicely?

Depends…

•Is it just a few points that are small ‘outliers’?

•Or is the data simply not amenable to this kind of classification?

A few outliers – probably can still find a ‘good’ line.

Almost linearly separable data.

Not linearly separable data.

Inherently, the data cannot be separated by any one line.

Cancerous

Healthy

Cancerous

Healthy

Linearly separable data.

Can make a great classifier.

Cancerous

Healthy

-5

0 5-5

0

5


Inherently, the data cannot be separated by any one line.

Cancerous

Healthy

•If we allow the model to have more than one line (or hyperplane), then maybe we can still form a nice model.

•Much more complicated.

•This is one thing that neural networks allow us to do: combine linear discriminants together to form a single classifier (no longer a linear classifier).

•No time to delve further during this talk.

-5

0 5-5

0

5


Now what??

Even with many lines it would be extremely difficult to build a good classifier.

0 5


Need to transform the coordinates: polar coordinates, Principal Components coordinates, kernel transformation into higher dimensional space (support vector machines).

Distance from center (radius)

Ang

ular

de g

ree

( pha

s e)

Linearly separable data.

polar coordinates

Sometimes Need to Transform the Data

Caveats• May need to find a subset of the data that is linearly separable

(called feature selection). • Feature selection is what we call in computer science, an NP-

complete problem, which means, in layman’s terms: impossible to solve exactly. Feature selection is an open research problem.

• There are a spate of techniques that give you approximate solutions to feature selection.

• Features selection is mandatory in microarray expression experiments because there is so much noisy, irrelevant data.

• Also, with microarray data, there is much missing data – introduces difficulties.

Other Biological Applications

• Gene finding in DNA: (input is part of DNA strand, output is whether or not nucleotide at centre is inside of a gene).

• Sequence-based gene classification: the input is a gene sequence, output is a functional class.

• Protein secondary structure prediction: input is a sequence of amino acids, output is the local secondary structure.

• Protein localization in cell: the input is an amino acid sequence, the output is position in the cell (eg. nucleus, membrane, etc.)

Taken from Introduction to Support Vector Machines and Applications to Computational Biology, Jean Philippe Vert

Wrap-Up

• Intuitive feel for linear discriminants.• Widely applicable technique – for many

problems in Polyomx and many other areas.• Difficulties: missing data, feature selection.• Have used linear discriminants for our SNP

data and microarray data.

If interested in knowing more, great book:Neural Networks for Pattern Recognition, Christopher Bishop, 1999.

Minimize objective function 1. Exact solution via matrix algebra since here E is convex.

2. Iterative algorithms (gradient descent, conjugate gradient, Newton’s method, etc.) for cases where E may not be convex.

Finding the Equation of the Linear Discriminant(How a Single Layer Neural Network Might Do It)

0)( wy T xwx

w

0)( xy0)( xy

0)( xy

The discriminant function:

Eg. Sum-of-squares error function (more for regression):

N

n

nnT twE1

20 ))()( xww

0w

Kw

E

w

E

w

EE ,...,,)(

21

})1,1{( nt

Can regularize by adding in ||w||2 to E.

Minimize ||w||2 subject to the following constraints:

0)( wy T xwx

||||||||0

ww

xw wT

0)( xy0)( xy

0)( xy

The discriminant function:

The margin is given by:

})1,1{( nt

||||||||0

ww

xw wT

Finding the Equation of the Linear Discriminant(How an SVM would do it.)

01)(,...1 0 wtKi i xwT

Use Lagrange Multipliers

]1)([||||),,( 01

20

wtwL i

N

ii xwwλw T

a short and simple introduction to linear discriminants (with almost no math) jennifer listgarten,...

Documents