inferring function by homology the fact that functionally important aspects of sequences are...
TRANSCRIPT
Inferring function by Inferring function by homologyhomology
The fact that functionally important aspects of The fact that functionally important aspects of sequences are conserved across evolutionary sequences are conserved across evolutionary time allows us to find, by homology searching, time allows us to find, by homology searching, the equivalent genes in one species to those the equivalent genes in one species to those known to be important in other model species. known to be important in other model species.
Logic: if the linear alignment of a pair of Logic: if the linear alignment of a pair of sequences is similar, then we can infer that the 3-sequences is similar, then we can infer that the 3-dimensional structure is similar; if the 3-D dimensional structure is similar; if the 3-D structure is similar then there is a good chance structure is similar then there is a good chance that the function is similar.that the function is similar.
BASIC LOCAL ALIGNMENT SEARCH TOOLSBASIC LOCAL ALIGNMENT SEARCH TOOLS ((BLAST)BLAST)
BLAST programs (there are several) compare a query sequence BLAST programs (there are several) compare a query sequence to all the sequences in a database in a pairwise manner.to all the sequences in a database in a pairwise manner.
Breaks: query and database sequences into fragments known Breaks: query and database sequences into fragments known as "words", and seeks matches between them.as "words", and seeks matches between them.
Attempts to align query words of length "W" to words in the Attempts to align query words of length "W" to words in the database such that the alignment scores at least a threshold database such that the alignment scores at least a threshold value, "T".value, "T". known as High-Scoring Segment Pairs (HSPs)known as High-Scoring Segment Pairs (HSPs)
HSPs are then extended in either direction in an attempt to HSPs are then extended in either direction in an attempt to generate an alignment with a score exceeding another generate an alignment with a score exceeding another threshold, "S", known as a Maximal-Scoring Segment Pair threshold, "S", known as a Maximal-Scoring Segment Pair (MSP)(MSP)
2 sequence alignment2 sequence alignment
To align GARFIELDTHECAT withTo align GARFIELDTHECAT with GARFIELDTHERAT is easy GARFIELDTHERAT is easy
GARFIELDTHECATGARFIELDTHECAT
||||||||||| ||||||||||||| ||
GARFIELDTHERAT GARFIELDTHERAT
GapsGaps
Sometimes, you can get a better overall Sometimes, you can get a better overall alignment if you insert gapsalignment if you insert gaps
GARFIELDTHECAT GARFIELDTHECAT |||||||| ||| |||||||| ||| GARFIELDA--CATGARFIELDA--CAT is better (scores higher) than is better (scores higher) than GARFIELDTHECAT GARFIELDTHECAT |||||||| |||||||| GARFIELDACAT GARFIELDACAT
No gap penaltyNo gap penalty
But there has to be some sort of a But there has to be some sort of a gap-penalty otherwise you can gap-penalty otherwise you can align ANY two sequences: align ANY two sequences:
G-R--E------AT G-R--E------AT
| | | || | | | ||
GARFIELDTHECATGARFIELDTHECAT
Affine gap penaltyAffine gap penalty
Could set a score for each indelCould set a score for each indel Usually use affine (open + extend)Usually use affine (open + extend) Open –10, extend -0.05Open –10, extend -0.05
2+ similar sequences2+ similar sequences
When doing a similarity search against a When doing a similarity search against a
databasedatabase
you are trying to decide which of many you are trying to decide which of many
sequences is the sequences is the CLOSESTCLOSEST match to your search match to your search
sequence. sequence.
Which of the following alignment pairs is Which of the following alignment pairs is
better?: better?:
Scoring AlignmentsScoring Alignments
GARFIELDTHECAT GARFIELDTHECAT |||| ||||||||||| |||||||GARFRIEDTHECAT GARFRIEDTHECAT
GARFIELDTHECATGARFIELDTHECAT||| ||| ||||| ||| ||| ||||| GARWIELESHECAT GARWIELESHECAT
GARFIELDTHECAT GARFIELDTHECAT || ||||||| || || ||||||| || GAVGIELDTHEMATGAVGIELDTHEMAT
Willie Taylor’s AA Venn DiagramWillie Taylor’s AA Venn Diagram
Substitution matricesSubstitution matrices#BLOSUM 90#BLOSUM 90 A R N D C Q E G H I LA R N D C Q E G H I LA 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3N -2 -1 7 1 -4 0 -1 -1 0 -4 -4N -2 -1 7 1 -4 0 -1 -1 0 -4 -4D -3 -3 1 7 -5 -1 D -3 -3 1 7 -5 -1 11 -2 -2 -5 -5 -2 -2 -5 -5C -1 -5 -4 -5 C -1 -5 -4 -5 99 -4 -6 -4 -5 -2 -2 -4 -6 -4 -5 -2 -2Q -1 1 0 -1 -4 7 2 -3 1 -4 -3Q -1 1 0 -1 -4 7 2 -3 1 -4 -3E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5H -2 0 0 -2 -5 1 -1 -3 8 -4 -4H -2 0 0 -2 -5 1 -1 -3 8 -4 -4I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5
Low Complexity MaskingLow Complexity Masking
Some sequences are similar even if they have no Some sequences are similar even if they have no recentrecentcommon ancestor. common ancestor.
Huntington's disease is caused by poly CAG tracks in Huntington's disease is caused by poly CAG tracks in the DNA which results in polyGlutamine (Gln, Q) the DNA which results in polyGlutamine (Gln, Q) tracks in the protein. tracks in the protein.
If you do a homology search with QQQQQQQQQQ you If you do a homology search with QQQQQQQQQQ you get hits to other proteins that have a lot of get hits to other proteins that have a lot of glutamines but have totally different function.glutamines but have totally different function.
2 sequence alignment2 sequence alignment
Huntingtin: Huntingtin: MATLEKLMKA FESLKSFQQQ QQQQQQQQQQMATLEKLMKA FESLKSFQQQ QQQQQQQQQQQQQQQQQQQQ PPPPPPPPPP PQLPQPPPQAQQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA hitshits
>MM16_MOUSE MATRIX METALLOPROTEINASE-16 Score = >MM16_MOUSE MATRIX METALLOPROTEINASE-16 Score = 34.4 bits (78), Expect = 0.18 Identities = 21/65 (32%), 34.4 bits (78), Expect = 0.18 Identities = 21/65 (32%), Positives = 25/65 (38%), Gaps = 2/65 (3%):Positives = 25/65 (38%), Gaps = 2/65 (3%):
FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLLPQPQPPPPPPFQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLLPQPQPPPPPPF Q + + Q Q+ PP PPP LP PP P P+ P PPF Q + + Q Q+ PP PPP LP PP P P+ P PPFYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPADPRRHDRPKPPFYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPADPRRHDRPKPP
But not because it is involved in But not because it is involved in microtubule mediated transport! microtubule mediated transport!
E valuesE values
An E-value is a measure of the probability of any An E-value is a measure of the probability of any given hit occurring by chance.given hit occurring by chance.
Dependent on the size of the query sequence and Dependent on the size of the query sequence and the database. the database.
The lower the E-value the more confidence you can The lower the E-value the more confidence you can have that a hit is a true homologue (sequence have that a hit is a true homologue (sequence related by common descent).related by common descent).
Dotplot Dotplot theorytheory
A T G A T A T T C T T A . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .
Task: align ATGATATTCTT and ATTGTTC
Another way of comparing 2 sequences
A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .
Go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to ATT (the first 3 bases in the vertical sequence)
A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .
Then go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to TTG (the next 3 in the vertical sequence).
A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . + . . . . . + . . T . . . + . . . . . + . T . . . . . . . + . . . C . . . . . . . . . . .
Iterate until
A T G A T A T T C T T A T + + + + T + + G + + T + + T + C
The human eye is particularly good at picking up structure from the pattern of dots. You might see a hint of a duplicated region in the horizontal sequence that is not so clear from the sequence itself