lecture 4 sequence alignment: how to discover similarities ...€¦ · an alignment is an...

24
Lecture 4 Sequence alignment: how to discover similarities between biological sequences Chapter 6 in Jones and Pevzner Fall 2019 September 10, 2019

Upload: others

Post on 28-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Lecture4Sequencealignment:howtodiscoversimilaritiesbetweenbiologicalsequencesChapter6inJonesandPevzner

Fall2019September10,2019

Page 2: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Evolutionasatoolforbiologicalinsight

•  “Nothinginbiologymakessenseexceptinthelightofevolution”-TheodosiusDobzhansky.

•  Thefunctionalityofmany

genesisvirtuallythesameamongmanyorganisms:Canunderstandbiologyinsimplerorganismsthanourselves(“modelorganisms”).

Page 3: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Homology

•  GenesinorganismsAandBthathaveevolvedfromthesameancestralgenearesaidtobehomologs.

•  Homologybetweengenestypicallyindicatesconservedfunction.

•  Sequencesimilarityisusedtoinferhomology.

Page 4: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

SequenceComparison:EarlySuccessStory

•  In1983RussellDoolittleandcolleaguesfoundsimilaritiesbetweenacancer-causinggenefromtheSimianSarcomavirusandanormalgrowthfactorgene(PDGF).

•  Findingsequencesimilaritieswithgenesofknownfunctionisacommonapproachtoinferanewlysequencedgene’sfunction.

Page 5: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Thedrosophila“eyeless”gene

• W.Gehringdiscoveredthatturningonthe“eyeless”geneindrosophilaleadstothegrowthofectopiceyes.

•  “eyeless”isamastercontrolgeneforeyeformation(transcriptionfactor).

Page 6: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Asimilargeneinhumans

•  Theaniridiageneinhumanshasasequencethatissimilartothedrosophilaeyelessgene.

•  Eyemorphogenesisisundersimilargeneticcontrolinvertebratesandinsects.

Page 7: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

5 HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 54 ||||||||||||.||||||||||||||||||||||||||||||||||||| 57 HSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 106 55 KILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRD 104 ||||||||||||||||||||||||||.||||||:|||||||||||||||| 107 KILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRD 156 105 RLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGA--------------- 139 |||.|.|||||||||||||||||||||::|:|... 157 RLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISA 206 155 -----------SWGTR---PGWYPGTSVPGQPTQ---------------- 174 ||..| ..||| ||:...|.. 307 NHQALQQHQQQSWPPRHYSGSWYP-TSLSEIPISSAPNIASVTAYASGPS 355 175 ------------------------------------DGCQQQE---GGGE 185 ||.|..| |.|| 356 LAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGE 405 186 NTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYP 235 |:|..:||..::::.|.||.|||||||||||||.:||::||||||||||| 406 NSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYP 455

PAX6_HUMAN aligned against PAX6_DRO

Page 8: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Sequencealignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiventwostrings v=v1v2...vm,w=w1w2…wn,

analignmentisanassignmentofgapstopositions0,…,minv,and0,…,ninw,soastolineupeach letterinonesequencewitheitheraletter,oragapintheothersequence.

Page 9: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

MutationsattheDNAlevel

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

SubstitutionSEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Page 10: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Scoringanalignment

•  Asimplescoringscheme:•  Penalizemismatchesby–μ•  Penalizeindelsby–σ,•  Rewardmatcheswith+1

•  Resultingscore:#matches–(#mismatches)μ–(#indels)σ

•  Objective:findthebestscoringalignment

Page 11: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Numberofpairwisealignments

•  Givensequencesoflengthmandn,thenumberofalignmentsis:

•  Fortwosequencesoflengthn:

min(m,n)⇤

k=0

�m

k

⇥�n

k

⇥=

�n + m

n

�2n

n

⇥=

(2n)!(n!)2

� 22n

⇥�n

n! �⇥

2�n�n

e

⇥nDerived using Stirling’s approximation:

Page 12: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Substringsandsubsequences

Definition:Astringx’isasubstringofastringx, ifx=ux’vforsomeprefixstringuandsuffixstringv (x’=xi…xj,forsome1≤i≤j≤|x|)

Astringx’isasubsequenceofastringx

ifx’canbeobtainedfromxbydeleting0ormoreletters

(x’=xi1…xik,forsome1≤i1≤…≤ik≤|x|)

Note:asubstringisalwaysasubsequence

Example:x=abracadabra y=cadabr; substring z=brcdbr; subseqence,notsubstring

Page 13: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Encodingalignmentasapathina2-dgrid

A T -- C T G A T C-- T G C T -- A -- C

elements of v

elements of w

--

A1

2

0

1

2

2

3

3

4

3

5

4

5

5

6

6

6

7

7

8

j coords:

i coords:

Everyalignmentisapathin2-Dgrid

0

0

(0,0)à (1,0)à (2,1)à (2,2)à (3,3)à (3,4)à (4,5)à (5,5)à (6,6)à (7,6)à (8,7)

Page 14: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Alignmentasapath

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0 i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

Page 15: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

AlignmentasaPathintheEditGraph

0 1 2 2 3 4 5 6 7 7 A T - G T T A T - A T C G T - A - C0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)

- Corresponding path -

Page 16: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

AlignmentasaPathintheEditGraph

andrepresentindelsinvandwwithscore-1.representmatcheswithscore1.Thescoreofthealignmentis1.

Page 17: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

AlignmentasaPathintheEditGraph

Everypathintheeditgraphcorrespondstoanalignment:

Page 18: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Alignmentalgorithmswewillcover

•  Globalalignment•  Localalignment•  Alignmentwithaffinegappenalties•  Scoringmatrices

Page 19: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

Oursimplescoringscheme

•  Thescorewhenmismatchesarepenalizedby–μ,indelsarepenalizedby-σ,andmatchesarerewardedby+1:

#matches–μ (#mismatches)–σ(#indels)

Page 20: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

GlobalAlignment:TheNeedleman-Wunschalgorithm1FindthebestalignmentbetweentwostringsunderourscoringschemeInput:StringsvandwandascoringschemeOutput:Maximumscoringalignmentsi-1,j-1+1ifvi=wjsi,j=maxsi-1,j-1-µifvi≠wjsi-1,j-σsi,j-1-σsi,j–thescoreforthebestalignmentofalengthiprefixofvandalengthj

prefixofw

µ : mismatchpenaltyσ : indelpenalty

1A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53, 1970.

Page 21: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

NeedlemanWunsch(cont)

• Whataboutthebasecase?

Page 22: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

NWasaDPalgorithmNW(v,w,sigma,mu)

for i in range(0, m):

si,0 = -sigma * i

for j in range(0, n) :

s0,j = -sigma * j

for i in range(1, m) :

for j in range(1, n) :

fill in si,j

return (sm,n)

Runtime: O(nm) Memory: O(nm)

Page 23: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

NowWhat?

•  TheDPalgorithmcreatedthealignmentgrid.

•  Toreadthebestalignment:Followthepointersfromsink.

Page 24: Lecture 4 Sequence alignment: how to discover similarities ...€¦ · an alignment is an assignment of gaps to positions 0,…,m in v , and 0,…,n in w , so as to line up each letter

ScoringMatrices

Togeneralizescoring,weuseascoringmatrixδ.Sizeofthematrix:AlignmentofDNAsequences:(4+1)x(4+1)Alignmentofaminoacids:(20+1)x(20+1)

Theadditionalrow/columnincludesscoresforthegapcharacter“-”

si-1,j-1+δ(vi,wj)si,j=maxsi-1,j+δ(vi,-)si,j-1+δ(-,wj)