cbio course, spring 2005, hebrew university (alignment) score statistics

cbio course, spring 2005, Hebrew University

(Alignment) Score Statistics


Motivation

• Reminder:• Basic motivation: we want to check if 2

sequences are “related” or not• We align 2 sequences and get a score (s) which

measures how similar they are• Given s, do we accept the hypothesis the 2 are

related or reject it ?

How high should s be so that we “believe” they are related ??


Motivation (2)

• We need a “rigorous way” to decide on a threshold s* so that for s > s*, we call the sequences “related”

• Note:• s* should obviously be s*(n,m) where n and m are the

length of the 2 sequences aligned• When we try matching sequence x against a D.B of N

(N>>1) sequences, we need to account for the fact we might see high scores “just by chance”

• We can make 2 kinds of mistakes in our calls:• FP• FN

→ We want our “rigorous way” to control FP FN mistakes


Motivation (3)

• The problem of assigning statistical significance to scores and controlling our FP and FN mistakes is of general interest.

• Examples:• Similarities between protein sequnece to profile

HMM• Log ratio scores when searching DNA sequence

motifs• ….

• The methods we develop now will be of general use


Reminder

• In the last lesson we talked about 2 ways to analyze alignment scores and their significance:• Bayesian• Classical EVD approach

• We reviewed how the amount of FP mistakes can be controlled using each of these approaces• We reviewed Karlin & Altshul (1990) results


Review First Approach – Bayesian

Where:

We saw:

Assume we have two states in our world :

M (Model = related sequences)

R (Random = un realated sequences)

Given a fixed alignment of two sequences (x,y) we ask “from which state it came from M or R ?”


Review Bayesian Approach cont.

• We saw that in order to control the expected number of false identifications, when testing scores which came from R, we need the threshold over the scores S* to have S* ~ log(number of trials * K )

Where:

Number of trials for scoring a sequence of length m in local aligment against N sequences of length n is nmN

K in [0,1] is correlation factor compensating for the fact the trials are correlated.


Review EVD Approach

• In the EVD approach we are interested in the question: “given a score s for aligning x and y, If this s came from a distribution of scores for unrelated sequences (like R in the Bayesian approach), What’s the probability of seeing a score as good as s by chance, simply because I tried so many matches of sequences against x”?

• R here is the null hypothesis we are testing against. • If P(score >= s | we tried N scores) < Threshold (say 0.01) then we “reject” the

null hypothesis (R)• NOTE:

• There is no “second” hypothesis here.

• We are guarding against type 1 errors (FP)

• No control or assumptions are made about FN here !!

• This setting is appropriate for the problem we have at hand (D.B search)


Toy Problem• Let s,t be two randomly chosen DNA sequences of length n sampled from the

uniform distribution over the DNA alphabet. • Align s versus t with no gaps (i.e. s[1] is aligned to t[1] until s[n] is aligned to t[n].)• What is the probability that there are k matches (not necessarily continuous

ones) between s and t? • Suppose you are a researcher and you have two main hypothesis:

Either these two sequences are totally unrelated or there was a common ancestor to both of them (there is no indel option here).

• How would you use the number of matches to decide between the two options and attach a statistical confidence measure to this decision?


0 10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25Empirical Distribution over scores, for p=0.25, using M = 100K samples

n = 100

n = 20

Pvalue for score = 30 and n= 100

NOTE: As in our “real” problem - pvalue of score depends on n


EVD for our problem

In the EVD approach, we are interested in the question:

“what is the probability of me seeing such a good a score as S* , only from matches to non related sequences, if I tried N such matches?”

Compute:

If we want to guarantee the P{ Max(S1 … SN) >= S*} < 0.05 where Si are scores of matches against non related sequences sampled i.i.d , then:

[1 – pvalue(S*)]N > 0.95

i.e


0 10 20 30 40 50 60 70 80 90 1000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18Compare original distribution over scores to its EVD with N = 10 (Original's params are n =100 p = 0.25 and M = 100K)

Score

Pro

b


Guarding against mistakes & evaluating performace

• In the EVD we kept guarding against FP mistakes. This is very important when doing tasks where many tests are preformed, as in our case of D.B search

• Sometime we are not able to compute EVD and we still want to control FPR. A very strict and simple Solution is the “Bonferroni corrected pvalue” = pvalue*NWhere N is the number of tests perfromed.

Note: The relation to the “union bound” is clearProblem: Bonf. Controls the FWER (family wise error rate) i.e the probability of

seeing even 1 mistake in the results we report as significant (a FP mistake). It does so with very basically no assumption on the distribution, the relations

between the hypothesis tested etc. and still guaranties control over FWER The price to pay is in FN …..


Bonf. Example on our case

• We saw:If we want to guarantee the P{ Max(S1 … SN) >= S*} < 0.05

where Si are scores of matches against non related sequences sampled i.i.d , then:

[1 – pvalue(S*)]N > 0.95

i.e

Compare for N = 10 the result for this equation: 0.005116

to the Bonf. Corrected pvalue: 0.05/N = 0.005

For N = 20: 0.00256 vs. 0.002

etc…

If we used the strict Bonf. for the same guarantee level we wanted, we might have rejected some “good” results.


How to estimate performance?• Say you have a method with a score (in our case: “method” = scoring

local alignment with affine gaps, and scoring matrix “Sigma” (e.g. Sigma = PAM1)

• You set a threshold over the scores based on some criteria (e.g EVD estimation of the scores in random matches)

• You want to evaluate your methods performace on some “test” data set. The data set would typically contain some true and false examples.

• Assumption: you KNOW the answers of this test set !Aim: You want to see the tradeoff you get for using various thresholds on

the scores, in terms of FP and FN on the data set.


ROC curves• ROC = Receiver Operator Curve

FPR = False Positive Rate = Empirical pvalue = FP/ (FP + TN) = FP / ( “real negatives”) = “What ratio of the bad ones we pass”

Sensitivity = TP/ (TP+FN) = “What ratio of the true ones we capture”

0%

0%

100%

100%

Best Performance


ROC curves• NOTE: Each point in the ROC matches a certain threshold over the method’s scores• Each method gives a different Curve• We can now compare methods performance:

• At a certain point on the graph• Via the total size of area under the graph

FPR = False Positive Rate = Empirical pvalue = FP/ (FP + TN) = FP / ( “real negatives”) = “What ratio of the bad ones we pass”

Sensitivity = TP/ (TP+FN) = “What ratio of the true ones we capture”

0%

0%

100%

100%

Best Performance

2%

cbio course, spring 2005, Hebrew University0 10 20 30 40 50 60 70 80 90 1000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1Distribution over scores, p= 0.25 Vs. p = 0.4 M = 100K

Pro

b

Score

Back to our Toy ProblemAssume the data we need to handle came from two sources, as in the Bayesian approach:

R – no related sequences, p(a,a) = 0.25

M – related sequences p(a,a) = 0.4 p(a,b) = 0.2

Delta scoring matrix i.e. S(a,a) = 1 S(a,b) = 0


Finish with a Thought…

• In our toy problem – what’s the relation between the graph of the last slide and the ROC curve we talked about?

• How does the relative amount of samples from M and R in our data set effects the ROC? How should the total distribution over the scores look like?

cbio course, spring 2005, hebrew university (alignment) score statistics

Documents

scores s

hebrew university slide

cbio course

threshold s

sequences x

hebrew university reminder

n sequences of length

realated sequences