biol335: homology search
DESCRIPTION
Course material for: http://www.canterbury.ac.nz/courseinfo/GetCourseDetails.aspx?course=BIOL335TRANSCRIPT
Homology Search
Paul Gardner
March 24, 2015
Paul Gardner Homology Search
News & Views reminder (20% of your course grade, dueMarch 26, Reviewed April 2 (5/20), Revisions April 28(15/20))
I Meredith et al. (2014) Evidence for a single loss ofmineralized teeth in the common avian ancestor. Science
I Nunez et al. (2015) Integrase-mediated spacer acquisitionduring CRISPR-Cas adaptive immunity. Nature
Paul Gardner Homology Search
Homology search
I In a huge collection of biologicalsequences how can you locatesimilar sequences?
I by using heuristic, super fast,sequence alignment methods
Paul Gardner Homology Search
BLAST
Paul Gardner Homology Search
BLAST
I Identify all ’hits’ of at least W long
I Find any hits on the same diagonal of an alignment matrix
I Trigger a full alignment in that region
Basic idea: identify near-identical sub-sequences first → align anyhits in full
Paul Gardner Homology Search
What does that E-value (Expect) mean?
>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome
Length=4537948
Features in this part of subject sequence:
cold-shock DNA-binding domain protein
Score = 57.2 bits (62), Expect = 2e-05
Identities = 78/106 (74%), Gaps = 6/106 (6%)
Strand=Plus/Plus
Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC
|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||
Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC
Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG
| | || |||||| ||| ||||||||||| |||||| ||| |||
Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG
Paul Gardner Homology Search
How can we evaluate the significance of a score?
I Note that a bit-score of 57.2 by itself is not that useful.I It depends on the sequence & database size & composition.I To counter this we can compute an Expect-value (E-value).
I This is the expected number of hits with the observed score forthe given query and database sizes.
I P-values can also be used
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
Separating true from false hits
score (bits)
Num
. mat
ches
Random sequences/Negative controlsTrue homologs/Positive controls
Threshold
False negatives
True positives
False positives
True negatives
Paul Gardner Homology Search
How can we evaluate the significance of a score?
0 100 200 300 400 500 600 700
0
2000
4000
6000
8000
10000
Separating true from false hits
score (bits)
Num
. mat
ches
Random sequences/Negative controlsTrue homologs/Positive controls
Threshold
False negatives
True positives
False positives
True negatives
E = κMN2−λx
E : E-valueM&N: query &database sizeκ&λ: fittingparameters
Paul Gardner Homology Search
BLAST is not the only, or best tool for the job!
Paul Gardner Homology Search
Profile-based homology search
Krogh, A. et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J MolBiol.Image provided by Eric Nawrocki.
Paul Gardner Homology Search
Profile-based homology search – scoring sequences
Image provided by Eric Nawrocki.
Paul Gardner Homology Search
Profile HMM are slightly more complicated
I A tree-weighting scheme takes care of unbalancedalignments
I Dirichlet-mixture priors are used to incorporate informationabout amino-acid biochemistry
I Effective sequence number is used to down-weight priorswhen many sequences are available
I Transition probabilities to Insert & Delete states are estimatedfrom the alignment
Paul Gardner Homology Search
Why not just use BLAST?
I ACCURACY!I Every benchmark of homology search tools has shown that
profile methods are more accurate than single-sequencemethods.
Eddy (2011) Accelerated Profile HMM Searches. PLoSComputational Biology.
Paul Gardner Homology Search
Why not just use BLAST?I SPEED! To search a single query vs a database of all proteins:
I BLAST: searches 42 million UniProt sequencesI HMMER: searches 15,000 Pfam profiles
I The search space is ∼ 3, 000x smaller for profilesI Save Planet Earth, use HMMER3
Eddy (2011) Accelerated Profile HMM Searches. PLoSComputational Biology.
Paul Gardner Homology Search
Pfam
What is a Pfam-A Entry?
hmmsearch
hmmbuild
hmmalign
SEED
HMM
OUTOUT
ALIGNDESC
Slide borrowed from Rob Finn.Paul Gardner Homology Search
But, what about RNA?
5’
3’
0Sequence conservation
1
AG
UK GCUCAUUCAC
CKW
Y UUAUGWYRGYCCC
gCYVU
U H R G C GGAAKA
YGYG
CUWCAUAA RM
YA
YCG
AAUGAYGC M H
AAGM
MWG
GUGCCU R
YCGUCC A MC
UWAa
CYGAUAW Y R
KGU
GMRURC
RCWU
UA
UCAAV
CAYC
GG
RC
GAMACGUY
GA GUK
AGGCACCGCC
UW
5’3’
0Sequence conservation
1
AA
YAAAAUAAUUUACAUUCCA AG
GACCGGUAU
UAUUGU A
GGGGAU
UUGU
GACU
UY C A
AGGCA
AYG
UCCUCU C
UA
CAA
CCGAGUUC R A
GA
AUAARY
AC
MAAYG
GCUC U U
UUU
GUU
AUU
CGAAAG C
UUA
CAAGDUV
YRGYRUMUU
CURUAURCU
CWCYUca
MUY
A CUUUC
MAGUACU
UCAC
AC GGGCCWRACAKMU
5’ 3’
0Sequence conservation
1
UVDWHAUGAUGA
GY
UC
MACUUCWUuGG
UC
CG U G U U U C U G A g a R MCYM
RUGAUMUBWRU
Ga
SA
AaGUUCUGAY
UHM
Paul Gardner Homology Search
Covariance models
Nawrocki & Eddy (2007) Query-Dependent Banding (QDB) for Faster RNA Similarity Searches. PLOScomputational biology.
Paul Gardner Homology Search
Benchmark
Freyhult, Bollback & Gardner (2007) Exploring genomic dark matter: A critical assessment of the performance ofhomology search methods on noncoding RNA. Genome Research.
Paul Gardner Homology Search
Rfam
Paul Gardner Homology Search
Relevant reading
I Reviews:I Eddy SR (2004) What is a hidden Markov model? Nature
Biotechnology.
I Methods:I Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs. Nucleicacids research.
I Eddy (2011) Accelerated Profile HMM Searches. PLoSComputational Biology.
Paul Gardner Homology Search
The End
Paul Gardner Homology Search