1 lesson 2 aligning sequences and searching databases
TRANSCRIPT
![Page 1: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/1.jpg)
1
Lesson 2
Aligning sequences and searching databases
![Page 2: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/2.jpg)
2
Homology and sequence alignment.
![Page 3: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/3.jpg)
HomologyHomology = Similarity between objects due to a common ancestry
Hund = Dog,Schwein = Pig
![Page 4: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/4.jpg)
4
Sequence homology
VLSPAVKWAKVGAHAAGHG||| || |||| | ||||VLSEAVLWAKVEADVAGHG
Similarity between sequences as a result of common ancestry.
![Page 5: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/5.jpg)
5
Sequence alignment
Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.
![Page 6: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/6.jpg)
6
Why align?VLSPAVKWAKV||| || |||| VLSEAVLWAKV
1. To detect if two sequences are homologous. If so, homology may indicate similarity in function (and structure).
2. Required for evolutionary studies (e.g., tree reconstruction).
3. To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site).
4. Given a sequenced DNA, from an unknown region, align it to the genome.
![Page 7: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/7.jpg)
7
Insertions, deletions, and substitutions
![Page 8: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/8.jpg)
8
Sequence alignment
If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
![Page 9: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/9.jpg)
9
Perfect match
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case).
![Page 10: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/10.jpg)
10
A substitution
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV VLSEAVLWAKV
A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred).
![Page 11: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/11.jpg)
11
Indel
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV
VLSEAVLWAKV
Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion.
VLSEAVLWAKV
![Page 12: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/12.jpg)
12
Indel
VLSPAV-WAKV||| || |||| VLSEAVLWAKV
VLSPAV-WAKV
VLSEAVWAKV
Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion.
VLSEAVLWAKV
L
![Page 13: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/13.jpg)
13
Indel
VLSPAV-WAKV
Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel.
VLSEAVLWAKV
Deletion? Insertion?
![Page 14: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/14.jpg)
14
Indels in protein coding genes
Indels in protein coding genes are often of 3bp, 6bp, 9bp, etc...
Gene Search
In fact, searching for indels of length 3K (K=1,2,3,…) can help algorithms that search a genome for coding regions
![Page 15: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/15.jpg)
15
Global and Local pairwise alignments
![Page 16: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/16.jpg)
16
Global vs. Local
• Global alignment – finds the best alignment across the entire two sequences.
• Local alignment – finds regions of similarity in parts of the sequences.
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ
ADLG CDRYFQ|||| |||| |ADLG CDRYYQ
Global alignment:
forces alignment in
regions which differ
Local alignment will
return only regions of
good alignment
![Page 17: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/17.jpg)
17
Global alignment
PTK2 protein tyrosine kinase 2 of human and rhesus monkey
![Page 18: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/18.jpg)
18
Proteins are comprised of domains
Domain B
Protein tyrosine kinase domain
Domain A
Human PTK2 :
![Page 19: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/19.jpg)
19
Protein tyrosine kinase domain
In leukocytes, a different gene for tyrosine kinase is expressed.
Domain X
Protein tyrosine kinase domain
Domain A
![Page 20: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/20.jpg)
20
Domain X
Protein tyrosine kinase domain
Domain BProtein tyrosine kinase domain
Domain A
Leukocyte TK
PTK2 The sequence similarity is restricted to a single domain
![Page 21: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/21.jpg)
21
Global alignment of PTK and LTK
![Page 22: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/22.jpg)
22
Local alignment of PTK and LTK
![Page 23: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/23.jpg)
23
Conclusions
Use global alignment when the two sequences share the same overall sequence arrangement.
Use local alignment to detect regions of similarity.
![Page 24: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/24.jpg)
24
How alignments are computed
![Page 25: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/25.jpg)
25
Pairwise alignment
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
One possible alignment:
![Page 26: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/26.jpg)
26
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
This alignment includes:2 mismatches 4 indels (gap)
10 perfect matches
![Page 27: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/27.jpg)
27
Choosing an alignment for a pair of sequences
AAGCTGAATTCGAAAGGCTCATTTCTGA
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Which alignment is better?
Many different alignments are
possible for 2 sequences:
![Page 28: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/28.jpg)
28
Scoring system (naïve)
AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-
Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1
A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-
Higher score Better alignment
Perfect match: +1
Mismatch: -2
Indel (gap): -1
![Page 29: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/29.jpg)
29
Alignment scoring - scoring of sequence similarity:
Assumes independence between positions:each position is considered separately
Scores each position:• Positive if identical (match)• Negative if different (mismatch or gap)
Total score = sum of position scoresCan be positive or negative
![Page 30: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/30.jpg)
30
Scoring systems
![Page 31: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/31.jpg)
31
Scoring system
•In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary
•Different scoring systems different alignments
•We want a good scoring system…
![Page 32: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/32.jpg)
32
Scoring matrix
A G C T
A 2
G -6 2
C -6 -6 2
T -6 -6 -6 2
•Representing the scoring system as a table or matrix n X n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids)
•symmetric
![Page 33: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/33.jpg)
33
DNA scoring matrices
• Uniform substitutions between all nucleotides:
From
To
A G C T
A 2
G -6 2
C -6 -6 2
T -6 -6 -6 2
MatchMismatch
![Page 34: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/34.jpg)
34
DNA scoring matrices
Can take into account biological phenomena such as:
• Transition-transversion
![Page 35: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/35.jpg)
35
Amino-acid scoring matrices• Take into account physico-chemical properties
![Page 36: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/36.jpg)
36
Scoring gaps (I)
In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened.
Gap extension penalty < Gap opening penalty
![Page 37: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/37.jpg)
37
Scoring gaps (II)
The dependency between the penalty and the length of the gap need not to be linear.
AGGGTTC—GAAGGGTTCTGA Score = -2
AGGGTT-—GAAGGGTTCTGA Score = -4
AGGGT--—GAAGGGTTCTGA Score = -6
AGGG---—GAAGGGTTCTGA Score = -8
Linear penalty
![Page 38: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/38.jpg)
38
Scoring gaps (II)
The dependency between the penalty and the length of the gap need not to be linear.
AGGGTTC—GAAGGGTTCTGA Score = -4
AGGGTT-—GAAGGGTTCTGA Score = -6
AGGGT--—GAAGGGTTCTGA Score = -7
AGGG---—GAAGGGTTCTGA Score = -8
Non-linear penalty
![Page 39: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/39.jpg)
39
PAM AND BLOSUM
![Page 40: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/40.jpg)
40
Amino-acid substitution matrices
• Actual substitutions:– Based on empirical data– Commonly used by many bioinformatics
programs– PAM & BLOSUM
![Page 41: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/41.jpg)
41
Protein matrices – actual substitutions
The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how
frequently they substitute each other M G Y D EM G Y D EM G Y E EM G Y D EM G Y Q EM G Y D EM G Y E EM G Y E E
In the fourth columnE and D are found in 7 / 8
![Page 42: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/42.jpg)
42
PAM Matrix - Point Accepted Mutations
• The Dayhoff PAM matrix is based on a database of 1,572 changes in 71 groups of closely related proteins (85% identity => Alignment was easy and reliable).
• Counted the number of substitutions per amino-acid pair (20 x 20)
• Found that common substitutions occurred between chemically similar amino acids
![Page 43: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/43.jpg)
43
PAM Matrices
• Family of matrices PAM 80, PAM 120, PAM 250
• The number on the PAM matrix represents evolutionary distance
• Larger numbers are for larger distances
![Page 44: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/44.jpg)
44
Example: PAM 250
Similar amino acids have greater score
![Page 45: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/45.jpg)
45
PAM - limitations
• Based only on a single, and limited dataset
• Examines proteins with few differences (85% identity)
• Based mainly on small globular proteins so the matrix is biased
![Page 46: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/46.jpg)
46
BLOSUM
• Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset
• BLOSUM observes significantly more replacements than PAM, even for infrequent pairs
![Page 47: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/47.jpg)
47
BLOSUM: Blocks Substitution Matrix
• Based on BLOCKS database – ~2000 blocks from 500 families of related
proteins– Families of proteins with identical function
• Blocks are short conserved patterns of 3-60 amino acids without gaps
AABCDA----BBCDADABCDA----BBCBBBBBCDA-AA-BCCAAAAACDA-A--CBCDBCCBADA---DBBDCCAAACAA----BBCCC
![Page 48: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/48.jpg)
48
BLOSUM
• Each block represents a sequence alignment with different identity percentage
• For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix
![Page 49: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/49.jpg)
49
BLOSUM Matrices
• BLOSUMn is based on sequences that share at least n percent identity
• BLOSUM62 represents closer sequences than BLOSUM45
![Page 50: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/50.jpg)
50
Example : Blosum62
Derived from blocks where the sequencesshare at least 62% identity
![Page 51: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/51.jpg)
51
PAM vs. BLOSUM
More distant sequences
PAM100 = BLOSUM90
PAM120 = BLOSUM80
PAM160 = BLOSUM60
PAM200 = BLOSUM52
PAM250 = BLOSUM45
![Page 52: 1 Lesson 2 Aligning sequences and searching databases](https://reader036.vdocuments.site/reader036/viewer/2022062307/551bea27550346be588b6284/html5/thumbnails/52.jpg)
52
Intermediate summary
1. Scoring system = substitution matrix + gap penalty.
2. Used for both global and local alignment
3. For amino acids, there are two types of substitution matrices: PAM and Blosum