lecture 3 – sequence alignment · lecture 3 – sequence alignment 15th september 2010 ....
TRANSCRIPT
![Page 1: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/1.jpg)
Lecture 3 – Sequence alignment
15th September 2010
![Page 2: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/2.jpg)
Bioinformatics course Course webpage
http://courses.cs.ut.ee/2010/bioinformatics/
Lecture 1 – Introduction to Bioinformatics Lecture 2 – Biological Databases, Assignment -1 Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence alignments, Assignment - 3
![Page 3: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/3.jpg)
Outline Introduction
Sequence similarity searches Similarity and homology Sequence alignment Alignment algorithms Scores
![Page 4: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/4.jpg)
Search word ?? Word Book name Text Title Reference ………
![Page 5: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/5.jpg)
Organisms/Speciescells(DNA-RNA-Protein)
Stored in databases What next ???
![Page 6: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/6.jpg)
DNA Mutation and Repair A mutation, which may arise during replication and/or
recombination, is a permanent change in the nucleotide sequence of DNA.
Damaged DNA can be mutated either by substitution, deletion or insertion of base pairs.
Mutations, for the most part, are harmless except when they lead to cell death or tumor formation.
Because of the lethal potential of DNA mutations cells have evolved mechanisms for repairing damaged DNA.
Types of Mutations There are three types of DNA Mutations: base
substitutions, deletions and insertions.
![Page 7: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/7.jpg)
Mutations/substitutions
![Page 8: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/8.jpg)
Point mutations
![Page 9: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/9.jpg)
Codon table
![Page 10: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/10.jpg)
Mutations / substitutions Synonymous substitutions
TTC (Phe) >>> TTT (Phe)
Non-synonymous substitutions TTC (Phe) >>> TTA (Leu)
![Page 11: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/11.jpg)
Why or when to compare two sequences ?
Are they homologous / share common ancestor Do they share same domain Identify the exact locations to see the common features-
active sites Compare a gene and its product
![Page 12: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/12.jpg)
Q. Similarity by chance or ancestral ? Homology and similarity used interchangebly Alignment can reveal homology
Orthologous Paralogous
Similarity – a sequence in question show some degree of match
![Page 13: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/13.jpg)
Sequence A sequence in question - Query A matching sequence - Hit
![Page 14: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/14.jpg)
Principles of sequence alignment
1 2
3 4
![Page 15: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/15.jpg)
Key concepts in sequence alignment
To locate equivalent regions of two or more sequences to maximize their similarity
Score : Identity = 85%
![Page 16: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/16.jpg)
Sequences of same length
Score : Identity = 30%
![Page 17: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/17.jpg)
Sequence of different length
Score : Identity = 69%
![Page 18: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/18.jpg)
When do you say sequences are homologous?
Nucleotide : if the paired sequence share atleast 70% identity over more than 100 bases (E-value lower than 10e-4
Protein : if the paired sequence share atleast 25% identity over more than 100 amino acids (E-value lower than 10e-4
![Page 19: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/19.jpg)
Choosing a method
Pairwise comparisons
Method Situations
Dot plot General exploration of your sequence Discovering repeats Finding long insertions/deletions Extracting portions of sequences to make a multiple alignment
Local alignments Comparing sequences with partial homology Making high-quality alignments Making residue-per-residue analysis
Global alignments Comparing two sequences over their entire length Identifying long insertions/deletions Checking the quality of your data Identifying every mutation in your sequence
BLAST
FASTA
![Page 20: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/20.jpg)
Dot plot Definition - is a graphical method that allows the
comparison of two biological sequences and identify regions of close similarity between them
Compare each sequence against the other Results
Repeated regions / domains Regions with small motifs repeated many times (low
complexity) Palindromes (portions of DNA repeated in opposite
directions) Potential secondary structures in RNA
![Page 21: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/21.jpg)
Aligning text
Raw Data ? A C A T G C A T T G
How many possible ways can we align ?
![Page 22: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/22.jpg)
Aligning text Raw Data ? A C A T G C A T T G
2 matches, 0 gaps A C A T G | | C A T T G
3 matches (2 gaps in ends)
A C A T G . | | | . C A T T G
4 matches, 1 insertion
A C A - T G | | | | . C A T T G
4 matches, 1 insertion
A C A T - G | | | | . C A T T G
![Page 23: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/23.jpg)
Dynamic programming What to do if the text is Bigger?
SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGGREGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEP
KPNEPRGDILLPTVGHALAFIERLERPELYGVNPEVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRTEDFDGVWAS Needleman-Wunsch (1970) provided first automatic
method Dynamic Programming to Find Global Alignment
![Page 24: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/24.jpg)
Aligning a 4 character A C B P
A C P M - A C P M A C - P M
![Page 25: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/25.jpg)
Global and Local alignments
![Page 26: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/26.jpg)
Local alignment The scoring system uses negative scores for
mismatches The minimum score for at a matrix element is zero Find the best score anywhere in the matrix (not just
last column or row) These three changes cause the algorithm
to seek high scoring subsequences, which are not penalized for their global effects, which don’t include areas of poor match, and which can occur anywhere
![Page 27: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/27.jpg)
Scoring matrices (BLOSSUM, PAM) eg.
A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
![Page 28: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/28.jpg)
Where do matrices come from Manually align protein structures
Look at frequency of a.a. substitutions at structurally constant sites
Compute log-odds S(aa-1,aa-2) = log2 ( freq(O) / freq(E) ) O = observed exchanges, E = expected exchanges
odds = freq(observed) / freq(expected) Sij = log odds freq(expected) = f(i)*f(j)
= is the chance of getting amino acid i in a column and then having it change to j
e.g. A-R pair observed only a tenth as often as expected
![Page 29: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/29.jpg)
Local vs. Global Alignment GLOBAL
best alignment of entirety of both sequences For optimum global alignment, we want best score in the final row or final
column Are these sequences generally the same? Needleman Wunsch find alignment in which total score is highest, perhaps at expense of areas of
great local similarity LOCAL
best alignment of segments, without regard to rest of sequence For optimum local alignment, we want best score anywhere in matrix Do these two sequences contain high scoring subsequences Smith Waterman find alignment in which the highest scoring subsequences are identified, at the
expense of the overall score
![Page 30: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/30.jpg)
Global vs Local alignments Global Local
![Page 31: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/31.jpg)
BLAST Extend hits into High Scoring Segment Pairs (HSPs) Stop extension when total score doesn’t increase Starts with all overlapping words from query Calculates “neighborhood” of each word using PAM
matrix and probability threshold matrix and probability threshold
Looks up all words and neighbors from query in database index
Extends High Scoring Pairs (HSPs) left and right to maximal length
Finds Maximal Segment Pairs (MSPs) between query and database
![Page 32: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/32.jpg)
Types of BLAST
![Page 33: Lecture 3 – Sequence alignment · Lecture 3 – Sequence alignment 15th September 2010 . Bioinformatics course ... Lecture 3 – Sequence alignments, Assignment -2 Lecture 4 - Sequence](https://reader036.vdocuments.site/reader036/viewer/2022062920/5f02ea927e708231d406a577/html5/thumbnails/33.jpg)
Practicals and assignments