biol335: homology search

Homology Search

Paul Gardner

March 24, 2015

Paul Gardner Homology Search

News & Views reminder (20% of your course grade, dueMarch 26, Reviewed April 2 (5/20), Revisions April 28(15/20))

I Meredith et al. (2014) Evidence for a single loss ofmineralized teeth in the common avian ancestor. Science

I Nunez et al. (2015) Integrase-mediated spacer acquisitionduring CRISPR-Cas adaptive immunity. Nature


Homology search

I In a huge collection of biologicalsequences how can you locatesimilar sequences?

I by using heuristic, super fast,sequence alignment methods


BLAST


BLAST

I Identify all ’hits’ of at least W long

I Find any hits on the same diagonal of an alignment matrix

I Trigger a full alignment in that region

Basic idea: identify near-identical sub-sequences first → align anyhits in full


What does that E-value (Expect) mean?

>gb|CP001191.1| Rhizobium leguminosarum bv. trifolii WSM2304, complete genome

Length=4537948

Features in this part of subject sequence:

cold-shock DNA-binding domain protein

Score = 57.2 bits (62), Expect = 2e-05

Identities = 78/106 (74%), Gaps = 6/106 (6%)

Strand=Plus/Plus

Query 1 CTTCGTCAGATTTCCTCTCAATATCGATCATACCGGACTGATATTCGTCCGG----GAAC

|| |||||||| ||||||||| |||||| | | | || |||| |||| ||||

Sbjct 828507 CTCCGTCAGATATCCTCTCAACATCGATACGGCTTGTCGGACATTCTTCCGCAGGCGAAC

Query 57 TCTAGCGATTGAAA-GGAAATCGTTATGAACTCAGGCACCGTAAAG

| | || |||||| ||| ||||||||||| |||||| ||| |||

Sbjct 828567 ACAA-CGGTTGAAAAGGAGATCGTTATGAATTCAGGCGTCGTCAAG


How can we evaluate the significance of a score?

I Note that a bit-score of 57.2 by itself is not that useful.I It depends on the sequence & database size & composition.I To counter this we can compute an Expect-value (E-value).

I This is the expected number of hits with the observed score forthe given query and database sizes.

I P-values can also be used

0 100 200 300 400 500 600 700

0

2000

4000

6000

8000

10000

Separating true from false hits

score (bits)

Num

. mat

ches

Random sequences/Negative controlsTrue homologs/Positive controls

Threshold

False negatives

True positives

False positives

True negatives


How can we evaluate the significance of a score?

0 100 200 300 400 500 600 700

0

2000

4000

6000

8000

10000

Separating true from false hits

score (bits)

Num

. mat

ches

Random sequences/Negative controlsTrue homologs/Positive controls

Threshold

False negatives

True positives

False positives

True negatives

E = κMN2−λx

E : E-valueM&N: query &database sizeκ&λ: fittingparameters


BLAST is not the only, or best tool for the job!


Profile-based homology search

Krogh, A. et al. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J MolBiol.Image provided by Eric Nawrocki.


Profile-based homology search – scoring sequences

Image provided by Eric Nawrocki.


Profile HMM are slightly more complicated

I A tree-weighting scheme takes care of unbalancedalignments

I Dirichlet-mixture priors are used to incorporate informationabout amino-acid biochemistry

I Effective sequence number is used to down-weight priorswhen many sequences are available

I Transition probabilities to Insert & Delete states are estimatedfrom the alignment


Why not just use BLAST?

I ACCURACY!I Every benchmark of homology search tools has shown that

profile methods are more accurate than single-sequencemethods.

Eddy (2011) Accelerated Profile HMM Searches. PLoSComputational Biology.


Why not just use BLAST?I SPEED! To search a single query vs a database of all proteins:

I BLAST: searches 42 million UniProt sequencesI HMMER: searches 15,000 Pfam profiles

I The search space is ∼ 3, 000x smaller for profilesI Save Planet Earth, use HMMER3

Eddy (2011) Accelerated Profile HMM Searches. PLoSComputational Biology.


Pfam

What is a Pfam-A Entry?

hmmsearch

hmmbuild

hmmalign

SEED

HMM

OUTOUT

ALIGNDESC

Slide borrowed from Rob Finn.Paul Gardner Homology Search

But, what about RNA?

5’

3’

0Sequence conservation

1

AG

UK GCUCAUUCAC

CKW

Y UUAUGWYRGYCCC

gCYVU

U H R G C GGAAKA

YGYG

CUWCAUAA RM

YA

YCG

AAUGAYGC M H

AAGM

MWG

GUGCCU R

YCGUCC A MC

UWAa

CYGAUAW Y R

KGU

GMRURC

RCWU

UA

UCAAV

CAYC

GG

RC

GAMACGUY

GA GUK

AGGCACCGCC

UW

5’3’


1

AA

YAAAAUAAUUUACAUUCCA AG

GACCGGUAU

UAUUGU A

GGGGAU

UUGU

GACU

UY C A

AGGCA

AYG

UCCUCU C

UA

CAA

CCGAGUUC R A

GA

AUAARY

AC

MAAYG

GCUC U U

UUU

GUU

AUU

CGAAAG C

UUA

CAAGDUV

YRGYRUMUU

CURUAURCU

CWCYUca

MUY

A CUUUC

MAGUACU

UCAC

AC GGGCCWRACAKMU

5’ 3’


1

UVDWHAUGAUGA

GY

UC

MACUUCWUuGG

UC

CG U G U U U C U G A g a R MCYM

RUGAUMUBWRU

Ga

SA

AaGUUCUGAY

UHM


Covariance models

Nawrocki & Eddy (2007) Query-Dependent Banding (QDB) for Faster RNA Similarity Searches. PLOScomputational biology.


Benchmark

Freyhult, Bollback & Gardner (2007) Exploring genomic dark matter: A critical assessment of the performance ofhomology search methods on noncoding RNA. Genome Research.


Rfam


Relevant reading

I Reviews:I Eddy SR (2004) What is a hidden Markov model? Nature

Biotechnology.

I Methods:I Altschul SF et al. (1997) Gapped BLAST and PSI-BLAST: a

new generation of protein database search programs. Nucleicacids research.

I Eddy (2011) Accelerated Profile HMM Searches. PLoSComputational Biology.


http://nar.oxfordjournals.org/content/25/17/3389.full

http://nar.oxfordjournals.org/content/25/17/3389.full

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002195

The End


biol335: homology search

Science

c u u u c

u u c g

g c u u

u u c u r u

y g u c c u c u c u

g u ac u u c

yg g c u c u u u u u

g u u c r