lecture 4 sequence alignment: how to discover similarities ...€¦ · an alignment is an...
TRANSCRIPT
Lecture4Sequencealignment:howtodiscoversimilaritiesbetweenbiologicalsequencesChapter6inJonesandPevzner
Fall2019September10,2019
Evolutionasatoolforbiologicalinsight
• “Nothinginbiologymakessenseexceptinthelightofevolution”-TheodosiusDobzhansky.
• Thefunctionalityofmany
genesisvirtuallythesameamongmanyorganisms:Canunderstandbiologyinsimplerorganismsthanourselves(“modelorganisms”).
Homology
• GenesinorganismsAandBthathaveevolvedfromthesameancestralgenearesaidtobehomologs.
• Homologybetweengenestypicallyindicatesconservedfunction.
• Sequencesimilarityisusedtoinferhomology.
SequenceComparison:EarlySuccessStory
• In1983RussellDoolittleandcolleaguesfoundsimilaritiesbetweenacancer-causinggenefromtheSimianSarcomavirusandanormalgrowthfactorgene(PDGF).
• Findingsequencesimilaritieswithgenesofknownfunctionisacommonapproachtoinferanewlysequencedgene’sfunction.
Thedrosophila“eyeless”gene
• W.Gehringdiscoveredthatturningonthe“eyeless”geneindrosophilaleadstothegrowthofectopiceyes.
• “eyeless”isamastercontrolgeneforeyeformation(transcriptionfactor).
Asimilargeneinhumans
• Theaniridiageneinhumanshasasequencethatissimilartothedrosophilaeyelessgene.
• Eyemorphogenesisisundersimilargeneticcontrolinvertebratesandinsects.
5 HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 54 ||||||||||||.||||||||||||||||||||||||||||||||||||| 57 HSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVS 106 55 KILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRD 104 ||||||||||||||||||||||||||.||||||:|||||||||||||||| 107 KILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRD 156 105 RLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGA--------------- 139 |||.|.|||||||||||||||||||||::|:|... 157 RLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISA 206 155 -----------SWGTR---PGWYPGTSVPGQPTQ---------------- 174 ||..| ..||| ||:...|.. 307 NHQALQQHQQQSWPPRHYSGSWYP-TSLSEIPISSAPNIASVTAYASGPS 355 175 ------------------------------------DGCQQQE---GGGE 185 ||.|..| |.|| 356 LAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGE 405 186 NTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYP 235 |:|..:||..::::.|.||.|||||||||||||.:||::||||||||||| 406 NSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYP 455
PAX6_HUMAN aligned against PAX6_DRO
Sequencealignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
DefinitionGiventwostrings v=v1v2...vm,w=w1w2…wn,
analignmentisanassignmentofgapstopositions0,…,minv,and0,…,ninw,soastolineupeach letterinonesequencewitheitheraletter,oragapintheothersequence.
MutationsattheDNAlevel
…ACGGTGCAGTTACCA…
…AC----CAGTCCACCA…
SubstitutionSEQUENCE EDITS
REARRANGEMENTS
Deletion
InversionTranslocationDuplication
Scoringanalignment
• Asimplescoringscheme:• Penalizemismatchesby–μ• Penalizeindelsby–σ,• Rewardmatcheswith+1
• Resultingscore:#matches–(#mismatches)μ–(#indels)σ
• Objective:findthebestscoringalignment
Numberofpairwisealignments
• Givensequencesoflengthmandn,thenumberofalignmentsis:
• Fortwosequencesoflengthn:
min(m,n)⇤
k=0
�m
k
⇥�n
k
⇥=
�n + m
n
⇥
�2n
n
⇥=
(2n)!(n!)2
� 22n
⇥�n
n! �⇥
2�n�n
e
⇥nDerived using Stirling’s approximation:
Substringsandsubsequences
Definition:Astringx’isasubstringofastringx, ifx=ux’vforsomeprefixstringuandsuffixstringv (x’=xi…xj,forsome1≤i≤j≤|x|)
Astringx’isasubsequenceofastringx
ifx’canbeobtainedfromxbydeleting0ormoreletters
(x’=xi1…xik,forsome1≤i1≤…≤ik≤|x|)
Note:asubstringisalwaysasubsequence
Example:x=abracadabra y=cadabr; substring z=brcdbr; subseqence,notsubstring
Encodingalignmentasapathina2-dgrid
A T -- C T G A T C-- T G C T -- A -- C
elements of v
elements of w
--
A1
2
0
1
2
2
3
3
4
3
5
4
5
5
6
6
6
7
7
8
j coords:
i coords:
Everyalignmentisapathin2-Dgrid
0
0
(0,0)à (1,0)à (2,1)à (2,2)à (3,3)à (3,4)à (4,5)à (5,5)à (6,6)à (7,6)à (8,7)
Alignmentasapath
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0 i
A T C T G A T C0 1 2 3 4 5 6 7 8
j
AlignmentasaPathintheEditGraph
0 1 2 2 3 4 5 6 7 7 A T - G T T A T - A T C G T - A - C0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
- Corresponding path -
AlignmentasaPathintheEditGraph
andrepresentindelsinvandwwithscore-1.representmatcheswithscore1.Thescoreofthealignmentis1.
AlignmentasaPathintheEditGraph
Everypathintheeditgraphcorrespondstoanalignment:
Alignmentalgorithmswewillcover
• Globalalignment• Localalignment• Alignmentwithaffinegappenalties• Scoringmatrices
Oursimplescoringscheme
• Thescorewhenmismatchesarepenalizedby–μ,indelsarepenalizedby-σ,andmatchesarerewardedby+1:
#matches–μ (#mismatches)–σ(#indels)
GlobalAlignment:TheNeedleman-Wunschalgorithm1FindthebestalignmentbetweentwostringsunderourscoringschemeInput:StringsvandwandascoringschemeOutput:Maximumscoringalignmentsi-1,j-1+1ifvi=wjsi,j=maxsi-1,j-1-µifvi≠wjsi-1,j-σsi,j-1-σsi,j–thescoreforthebestalignmentofalengthiprefixofvandalengthj
prefixofw
µ : mismatchpenaltyσ : indelpenalty
1A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53, 1970.
NeedlemanWunsch(cont)
• Whataboutthebasecase?
NWasaDPalgorithmNW(v,w,sigma,mu)
for i in range(0, m):
si,0 = -sigma * i
for j in range(0, n) :
s0,j = -sigma * j
for i in range(1, m) :
for j in range(1, n) :
fill in si,j
return (sm,n)
Runtime: O(nm) Memory: O(nm)
NowWhat?
• TheDPalgorithmcreatedthealignmentgrid.
• Toreadthebestalignment:Followthepointersfromsink.
ScoringMatrices
Togeneralizescoring,weuseascoringmatrixδ.Sizeofthematrix:AlignmentofDNAsequences:(4+1)x(4+1)Alignmentofaminoacids:(20+1)x(20+1)
Theadditionalrow/columnincludesscoresforthegapcharacter“-”
si-1,j-1+δ(vi,wj)si,j=maxsi-1,j+δ(vi,-)si,j-1+δ(-,wj)