lecture 4 bnfo 235 usman roshan. iupac nucleic acid symbols
Post on 21-Dec-2015
216 views
TRANSCRIPT
Splitting and joining strings
• split: splits a string by regular expression and returns array– @s = split(/,/);– @s = split(/\s+/);
• join: joins elements of array and returns a string (opposite of split)– $seq=join(“”, @pieces);– $seq=join(“X”, @pieces);
Searching and substitution
• $x =~ /$y/ ---- true if expression $y found in $x
• $x =~ /ATG/ --- true if open reading frame ATG found in $x
• $x !~ /GC/ --- true if GC not found in $x• $x =~ s/T/U/g --- replace all T’s with U’s• $x =~ s/g/G/g --- convert all lower case
g to upper case G
DNA Sequence Evolution
AAGACTT -3 mil yrs
-2 mil yrs
-1 mil yrs
today
AAGACTT
T_GACTTAAGGCTT
_GGGCTT TAGACCTT A_CACTT
ACCTT (Cat)
ACACTTC (Lion)
TAGCCCTTA (Monkey)
TAGGCCTT (Human)
GGCTT(Mouse)
T_GACTTAAGGCTT
AAGACTT
_GGGCTT TAGACCTT A_CACTT
AAGGCTT T_GACTT
AAGACTT
TAGGCCTT (Human)
TAGCCCTTA (Monkey)
A_C_CTT (Cat)
A_CACTTC (Lion)
_G_GCTT (Mouse)
_GGGCTT TAGACCTT A_CACTT
AAGGCTT T_GACTT
AAGACTT
Comparative Bioinformatics
• Fundamental notion of biology: all life is related by an unknown evolutionary Tree of Life.
• Therefore, if we know something about one species we can make inferences about other ones.
• Also, by comparing multiple species we can make inferences about sets of species.
• How do we compare DNA or protein sequences of two different species?
Comparative Bioinformatics
• We need to know how often do mutations from A to T occur or A to C occur.
• To determine this we manually create a set of “true” alignments and estimate the likelihood of A changing to C, for example, by counting the number of time A changes to C and computing related statistics.
• Now we have a realistic “scoring matrix” which can be used to evaluate how related are two species based on their DNA.
Problems
• Write a Perl subroutine called readmatrix that reads a DNA substitution scoring matrix from a file called “dna.txt” and stores it in a two dimensional array. The format of the scoring matrix in the file isA C G T
A 10 3 1 4C 3 12 3 5G 1 3 15 2T 4 5 2 11• Write a Perl subroutine called translate that takes an
mRNA sequence and converts it into a protein sequence and also returns the sequence.
Problems
• Write a Perl program that reads in a substitution scoring matrix from a file called “matrix.txt”, reads in a pair of DNA sequences of equal length from a file called “dna.txt”, and returns the total substitution score between the two sequences.
• Write a Perl program that reads pairs of DNA sequences from a file called “DNApairs.txt” and estimates the frequency of nucleotide substitutions.