better than chance: the importance of null models
TRANSCRIPT
![Page 1: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/1.jpg)
Better than Chance:the importance of null models
Kevin Karplus
Biomolecular Engineering Department
Undergraduate and Graduate Director, Bioinformatics
University of California, Santa Cruz
Better than Chance – p.1/33
![Page 2: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/2.jpg)
Outline of Talk
What is a null model (or null hypothesis) for?
Example 1: is a conserved ORF a protein?
Example 2: is residue-residue contact prediction betterthan chance?
Example 3: how should we remove composition biasesin HMM searches?
Better than Chance – p.2/33
![Page 3: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/3.jpg)
Scoring (Bayesian view)
Model M is a computable function that assigns aprobability P (A | M) to each sequence A.
When given a sequence A, we want to know how likelythe model is. That is, we want to compute something likeP (M | A).
Bayes Rule:
P(M
∣∣∣ A)
= P(A
∣∣∣ M) P(M)
P(A).
Problem: P(A) and P(M) are inherently unknowable.
Better than Chance – p.3/33
![Page 4: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/4.jpg)
Null models
Standard solution: ask how much more likely M is than somenull hypothesis (represented by a null model):
P (M | A)
P (N | A)=
P“A|M”
P“A|N” P(M)
P(N).
↑ ↑ ↑posterior odds likelihood ratio prior odds
Better than Chance – p.4/33
![Page 5: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/5.jpg)
Test your hypothesis
Thanks to Larry Gonick The Cartoon Guide to Statistics
Better than Chance – p.5/33
![Page 6: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/6.jpg)
Scoring (frequentist view)
We believe in models when they give a large score to ourobserved data.
Statistical tests (p-values or E-values) quantify how oftenwe should expect to see such good scores “by chance”.
These tests are based on a null model or null hypothesis.
Better than Chance – p.6/33
![Page 7: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/7.jpg)
Small p-value to reject null hypothesis
Thanks to Larry Gonick The Cartoon Guide to Statistics
Better than Chance – p.7/33
![Page 8: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/8.jpg)
Statistical Significance (2 approaches)
Markov’s inequality For any scoring scheme that uses
lnP (seq | M)
P (seq | N)
the probability of a score better than T is less than e−T forsequences distributed according to N .
Parameter fitting For “random” sequences drawn from somedistribution other than N , we can fit a parameterizedfamily of distributions to scores from a random sample,then compute P and E values.
Better than Chance – p.8/33
![Page 9: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/9.jpg)
Null models
P-values (and E-values) often tell us nothing about howgood our hypothesis is.
What they tell us is how bad our null model (nullhypothesis) is at explaining the data.
A badly chosen null model can make a very wronghypothesis look good.
Better than Chance – p.9/33
![Page 10: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/10.jpg)
Example 1: long ORF
A colleague found an ORF in an archæal genome thatwas 388 codons long and was wondering if it coded for aprotein and what the protein’s structure was.
We know that short ORFs can appear “by chance”.
So how likely is this ORF to be a chance event?
Better than Chance – p.10/33
![Page 11: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/11.jpg)
Null Model 1
DNA is undergoing no selection at all
G+C content bias. (GC is 36.7%, AT is 63.3%.)
Probability of stop codonTAG= 0.3165*0.3165*0.1835=0.0184, TGA=0.0184,TAA=0.0317, so p(STOP)=0.0685.
P(388 codons without stop) = (1 − p(STOP))388 = 1.1e-12
E-value in a 3 Megabase genome is about 3.3e-6.
We can easily reject the null hypothesis!
Better than Chance – p.11/33
![Page 12: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/12.jpg)
Null Model 2
I forgot to tell you: this ORF is on the opposite strand of aknown 560-codon ribosomal gene.
What is the probability of this long an ORF, on oppositestrand of known gene?
Generative model: simulate random codons using thecodon bias of the organism, take reverse complement,and see how often ORFs 388-long or longer appear.
Taking 100,000 samples, we get estimates of P-value inthe range 3e-05 to 6e-05.
There are about 3000 genes, giving us an E-value of 0.09to 0.18.
Better than Chance – p.12/33
![Page 13: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/13.jpg)
Null Model 3
We can do the same sort of simulation, but restrict thecodons to ones that would code for exactly the sameprotein on the forward strand.
Now we get P-value of around 0.01 for long ORFs on thereverse strand of genes coding for this protein.
Better than Chance – p.13/33
![Page 14: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/14.jpg)
Protein or chance ORF?
Thanks to Larry Gonick The Cartoon Guide to Statistics
Better than Chance – p.14/33
![Page 15: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/15.jpg)
Not a protein
+ A tblastn search with the sequence revealed similar ORFsin many genomes.
− All are on opposite strand of homologs of same gene.
− “Homologs” found by tblastn often include stop codons.
− There is no evidence for a TATA box upstream of the ORF.
− No strong evidence for selection beyond that explained byknown gene.
Conclusion: it is rather unlikely that this ORF encodes aprotein.
Better than Chance – p.15/33
![Page 16: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/16.jpg)
Example 1b: another ORF
pae0037: ORF, but probably not protein gene inPyrobaculum aerophilum
chr:--->
JGI genes
PAE0037
16850 16900 16950 17000 17050 17100 17150 17200
GC Percent in 20 Base Windows
Genbank RefSeq Gene Annotations
Arkin Lab Operon PredictionsGene annotation from JGI
Alternative ORFs noted by Sorel Fitz-Gibbon
Log-odds scan for promoters on plus strand (16 base window)
Log-odds scan for promoters on minus strand (16 base window)
Poly-T Terminators plus strand (7 nt window)
Poly-T Terminators minus strand (7 nt window)
PAE0039
GC Percent
Promoter +
Promoter -
Poly-T term (+)
Poly-T term (-)
Promoter on wrong side of ORF.
High GC content (need local, not global, null)
Strong RNA secondary structure.
Better than Chance – p.16/33
![Page 17: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/17.jpg)
Example 2: contacts
Is residue-residue contact prediction better than chance?
Early predictors (1994) reported results that were 1.4 to5.1 times “better than chance” on a sample of 11 proteins.
But they used a uniform null model:
P(residue i contacts residue j) = constant .
A better null model:
P (residue i contacts residue j) =
P(
contact∣∣∣ separation = |i − j|
).
Better than Chance – p.17/33
![Page 18: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/18.jpg)
P(contact|separation)
Using CASP definition of contact, CB within 8 Å, CA for GLY.
0.001
0.01
0.1
1
1 10 100
actu
al c
onta
cts
/ pos
sibl
e co
ntac
ts
separation
dunbrack-40pc-3157-CB8
Better than Chance – p.18/33
![Page 19: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/19.jpg)
Can get accuracy of 100%
By ignoring chain separations, the early predictors gotwhat sounded like good accuracy (0.37–0.68 for L/5predicted contacts)
But just predicting that i and i + 1 are in contact wouldhave gotten accuracy of 1.0 for even more predictions.
More recent work has excluded small-separation pairs,with different authors choosing different thresholds.
CASP uses separation ≥ 6, ≥ 12, and ≥ 24, with mostfocus on ≥ 24.
Better than Chance – p.19/33
![Page 20: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/20.jpg)
Evaluating contact prediction
Two measures of contact prediction:
Accuracy: ∑χ(i, j)∑
1
Weighted accuracy:
∑ χ(i,j)
P“contact|separation=|i−j|
”∑
1
= 1 if predictions no better than chance, independent ofseparations for predicted pairs.
Better than Chance – p.20/33
![Page 21: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/21.jpg)
Separation as predictor
If we predict all pairs with given separation as in contact, wedo much better than uniform model.
sep P“
contact˛̨˛ |i − j| = sep
”P
“contact
˛̨˛ |i − j| ≥ sep
”“better than chance”
6 0.0751 0.0147 4.96
9 0.0486 0.0142 3.42
12 0.0424 0.0136 3.13
24 0.0400 0.0116 3.46
Better than Chance – p.21/33
![Page 22: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/22.jpg)
Now doing better
separation ≥ 9
contacts/residue taken separately for each protein
0
0.1
0.2
0.3
0.4
0.5
0.6
0.1 1
true
pos
itive
s / p
redi
ctio
ns
predictions / residue
accuracy
CASP7 neural netNN(sep,mi e-value,propensity)
mi e-valuepropensity (wt)
0
5
10
15
20
25
0.1 1
wei
ghte
d po
sitiv
es /
pred
ictio
ns
predictions / residue
weighted accuracy
CASP7 neural netNN(sep,mi e-value,propensity)
mi e-valuepropensity
Better than Chance – p.22/33
![Page 23: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/23.jpg)
Contacts per residue
We can also use our null model to predict the number ofcontacts per residue (which is not a constant).
0
0.5
1
1.5
2
2.5
3
3.5
10 100 1000
cont
acts
/res
idue
number of residues
contacts with separation>=9
Better than Chance – p.23/33
![Page 24: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/24.jpg)
Example 3: HMM
Hidden Markov models assign a probability to eachsequence in a protein family.
A common task is to choose which of several proteinfamilies (represented by different HMMs) a proteinbelongs to.
Better than Chance – p.24/33
![Page 25: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/25.jpg)
Standard Null Model
Null model is an i.i.d (independent, identically distributed)model.
P(A
∣∣∣ N, len (A))
=
len(A)∏i=1
P(Ai) .
P(A
∣∣∣ N)
= P(sequence of length len (A))
len(A)∏i=1
P(Ai) .
Better than Chance – p.25/33
![Page 26: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/26.jpg)
Composition as source of error
When using the standard null model, certain sequencesand HMMs have anomalous behavior. Many of theproblems are due to unusual composition—a largenumber of some usually rare amino acid.
For example, metallothionein, with 24 cysteines in only 61total amino acids, scores well on any model with multiplehighly conserved cysteines.
Better than Chance – p.26/33
![Page 27: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/27.jpg)
Composition examples
Metallothionein Isoform II (4mt2)
Kistrin (1kst)
Better than Chance – p.27/33
![Page 28: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/28.jpg)
Composition examples
Kistrin (1kst)
Trypsin-binding domain of Bowman-Birk Inhibitor (1tabI)
Better than Chance – p.28/33
![Page 29: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/29.jpg)
Reversed model for null
We avoid this (and several other problems) by using areversed model M r as the null model.
The probability of a sequence in M r is exactly the sameas the probability of the reversal of the sequence givenM .
This method corrects for composition biases, lengthbiases, and several subtler biases.
Better than Chance – p.29/33
![Page 30: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/30.jpg)
Helix examples
Tropomyosin (2tmaA)
Colicin Ia (1cii)
Flavodoxin mutant (1vsgA)
Better than Chance – p.30/33
![Page 31: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/31.jpg)
Improvement from reversed model
1
10
100
1000
150 200 250 300 350 400 450
Fal
se P
ositi
ves
True Positives
SCOP whole chains
without reversed-model scoringwith reversed-model scoring
Better than Chance – p.31/33
![Page 32: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/32.jpg)
Take-home messages
Base your null models on biologically meaningful nullhypotheses, not just computationally convenient math.
Generative models and simulation can be useful for morecomplicated models.
Picking the right model remains more art than science.
Better than Chance – p.32/33
![Page 33: Better than Chance: the importance of null models](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fadd00337bd920db596d53/html5/thumbnails/33.jpg)
Web sites
List of my papers: http://www.soe.ucsc.edu/˜karplus/papers/
These slides: http://www.soe.ucsc.edu/˜karplus/papers/
better-than-chance-jul-07.pdf
Reverse-sequence null: Calibrating E-values for hidden Markov models with
reverse-sequence null models. Bioinformatics, 2005. 21(22):4107–4115;
doi:10.1093/bioinformatics/bti629
Archæal genome browser: http://archaea.ucsc.edu
UCSC bioinformatics (research and degree programs) info:
http://www.soe.ucsc.edu/research/compbio/
Better than Chance – p.33/33