6.1 before we start ( צילום : איתן שור ) let’s talk a bit about the last exercise, and...
TRANSCRIPT
![Page 1: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/1.jpg)
6.1
Before we start
)צילום: איתן שור(
Let’s talk a bit about the last exercise, and Eclipse…
![Page 2: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/2.jpg)
6.2
Comments following the last exercise
• Use chomp to remove \n from inputs
• Add remarks and document your code (see nice_code_example.pl)
• Treat @ARGV as you treat any other array• Use the $! to give the correct error after failing to open file. e.g. die "failed to open file '$file' $!".
• Make sure your outputs are as requested• Debug Debug & Debug!!!• Let us know if one of the questions cause you troubles• Make sure you understand the solutions on the course web-site
and ask if something remain unclear.
![Page 3: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/3.jpg)
6.3
if
• The order of conditions:if ((substr($fastaline,0,1) ne ">") and (defined $fastaline))
• What will happen if $fastaline is undefine?Use of uninitialized value $fastaline in split…
• The solution:if ((defined $fastaline) and (substr($fastaline,0,1) ne ">"))
1 2
![Page 4: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/4.jpg)
6.4
$arr[2]$arr[1]$arr[3]$arr[4]
Loops: foreachThe foreach loop passes through all the elements of an array
my @arr = (2,3,4,5,6);my $mul = 1;
@arr$num
$arr[0]
foreach my $num (@arr) { $mul = $mul *$num;
}
2 3 4 5 6undef
1120246
$mul
2720
![Page 5: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/5.jpg)
6.5
Some Eclipse Tips
• Try Ctrl+Shift+L Quick help (keyboard shortcuts)
• Try Ctrl+SPACE Auto-complete
• Source→Format (Ctrl+Shift+F) Correct indentation
• You can maximize a single view of Eclipse.
• Debug Debug & Debug!!!
• Break points . . .
• The (default) location of your files are:At home: D:\eclipse\perl_exComputer class: C:\eclipse\perl_ex
• Remove auto-complete of (),{},"" etc.: Windows -> Preferences -> Perl EPIC -> Editor make changes in "Smart typing"...
![Page 6: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/6.jpg)
6.6
Pattern matching
![Page 7: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/7.jpg)
6.7
We often want to find a certain piece of information within the file, for example:
Pattern matching
1. Exract GI numbers or
accessions from Fasta
2. Extract the coordinates of all open reading
frames from the annotation of a genome
3. Extract the accession, description and score of every hit in the output of BLAST
All these examples are patterns in the text.
We will see a wide range of the pattern-matching capabilities of Perl, but much more is available –
you are welcome to use documentation/tutorials/google.
>gi|16127995|ref|NP_414542.1| thr operon …>gi|145698229|ref|YP_001165309.1| hypothetical …>gi|90111153|ref|NP_415149.4| citrate …
>gi|16127995|ref|NP_414542.1| thr operon …>gi|145698229|ref|YP_001165309.1| hypothetical …>gi|90111153|ref|NP_415149.4| citrate …
Score ESequences producing significant alignments: (bits) Valueref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic... 186 1e-45ref|NT_039353.4|Mm6_39393_34 Mus musculus chromosome 6 genomic c... 38 0.71 ref|NT_039477.4|Mm9_39517_34 Mus musculus chromosome 9 genomic c... 36 2.8
Score ESequences producing significant alignments: (bits) Valueref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic... 186 1e-45ref|NT_039353.4|Mm6_39393_34 Mus musculus chromosome 6 genomic c... 38 0.71 ref|NT_039477.4|Mm9_39517_34 Mus musculus chromosome 9 genomic c... 36 2.8
CDS 1542..2033
CDS complement(3844..5180)
CDS 1542..2033
CDS complement(3844..5180)
![Page 8: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/8.jpg)
6.8
Finding a sub-string (match) somewhere in a string:
if ($line =~ m/he/) ... remember to use slash (/) and not back-slash
Will be true for “hello” and for “the cat” but not for “good bye” or “Hercules”.
You can ignore case of letters by adding an “i” after the pattern:
m/he/i
(matches for “the”, “Hello” , “Hercules” and “hEHD”)
There is a negative form of the match operator:
if ($line !~ m/he/) ...
Regular expression
![Page 9: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/9.jpg)
6.9
m/./ Matches any character (except “\n”)
You can also match one of a group of characters:
m/[atcg]/ Matches “a” or “t” or “c” or “g”
m/[a-d]/ Matches “a” though “d” (a, b, c or d)
m/[a-zA-Z]/ Matches any letter
m/[a-zA-Z0-9]/ Matches any letter or digit
m/[a-zA-Z0-9_]/ Matches any letter or digit or an underscore
m/[^atcg]/ Matches any character except “a” or “t” or “c” or “g”
m/[^0-9]/ Matches any character except a digit
Single-character patterns
![Page 10: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/10.jpg)
6.10
TATTAA
TATAATA
CTATATAATAGCTAGGCGCATG
✗✔
✔
For example:
if ($line =~ m/TATAA[AT]/)
Will be true for?
Single-character patterns
TATTAA
TATAATA
CTATATAATAGCTAGGCGCATG
![Page 11: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/11.jpg)
6.11
Perl provides predefined character classes:
\d a digit (same as: [0-9])
\w a “word” character (same as: [a-zA-Z0-9_])
\s a space character (same as: [ \t\n\r\f])
For example:
if ($line =~ m/class\.ex\d\.\S/)
Single-character patterns
And their negatives:
\D anything but a digit
\W anything but a word char
\S anything but a space char
✔
✗✔
class.ex3.1.pl
class.ex3.
my class.ex8.(old)
class.ex3.1.pl
class.ex3.
my class.ex8.(old)
![Page 12: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/12.jpg)
6.12
? means zero or one repetitions of what’s before it:
m/ab?c/ Matches “ac” or “abc”
+ means one or more repetitions of what’s before it:
m/ab+c/ Matches “abc” ; “abbbbc” but not “ac”
A pattern followed by * means zero or more repetitions of that patern:
m/ab*c/ Matches “abc” ; “ac” ; “abbbbc”
Generally – use { } for a certain number of repetitions, or a range:
m/ab{3}c/ Matches “abbbc”
m/ab{3,6}c/ Matches “a”, 3-6 times “b” and then “c”
m/ab{3,}c/ Matches “a”, “b” 3 times or more and then “c”
Use parentheses to mark more than one character for repetition:
m/h(el)*lo/ Matches “hello” ; “hlo” ; “helelello”
Repetitive patterns
![Page 13: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/13.jpg)
6.13
Question: What did one regular expression say to the other?
Answer :*.
Credit: http://slashdot.org/~jdew
We are now ready for some bad humor
![Page 14: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/14.jpg)
6.14
TATAAAGAATG
ACTATAATAAAAATG
TATAATGATGTATAATATG
✔
✔
✗
For example:
if ($line =~ m/TATAA[AT][ATCG]{2,4}ATG/)
Will be true for?
Repetitive patterns
TATAAAGAATG
ACTATAATAAAAATG
![Page 15: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/15.jpg)
6.15
Consider the following code:
print "please enter a line...\n";my $line = <STDIN>;chomp($line);
if ($line =~ m/-?\d+/) {print "This line seems to contain a number...\n";
}else {
print "This is certainly not a number...\n";}
Example code
![Page 16: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/16.jpg)
6.16
Consider the following code:
open(my $in, "<", "numbers.txt") or die "cannot open numbers.txt";my $line = <$in>;while (defined $line) {if ($line =~ m/-?\d+/) {
print "This line seems to contain a number...\n";}else {
print "This is certainly not a number...\n";}$line = <$in>;
}
Example code
![Page 17: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/17.jpg)
6.17 RegEx CoachAn easy-to-use tool for testing regular expressions:http://weitz.de/files/regex-coach.exe
•Also in eclipse
Window -> Show View -> Other...
from the Eclipse menu select
EPIC -> RegExp view from
the list.
![Page 18: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/18.jpg)
6.18Class exercise 6a
Write the following regular expressions. Test them with a script that reads a line from STDIN and prints "yes" if it matches and "no" if not.
1.Match a name containing a capital letter followed by three lower case letters
2.Match an NLS (nuclear localization signal) that starts with K followed by K or R followed by any character followed by either K or R.
3.Match an NLS that starts with K followed by K or R followed by any character except D or E, followed by either K or R. Match either lowercase or uppercase letters
4*. Match a line that contains in it at least 3 - 15 characters between quotes (without another quote inside the quotes).
![Page 19: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/19.jpg)
6.19
http://xkcd.com/208/
![Page 20: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/20.jpg)
6.20
Replacing a sub string (substitute):
$line = "the cat on the tree";
$line =~ s/he/hat/;
$line will be turned to “that cat on the tree”
To Replace all occurrences of a sub string add a “g” (for “globally”):
$line = "the cat on the tree";
$line =~ s/he/hat/g;
$line will be turned to “that cat on that tree”
Pattern matching
![Page 21: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/21.jpg)
6.21
Perl provides predefined character classes:
\d a digit (same as: [0-9])
\w a “word” character (same as: [a-zA-Z0-9_])
\s a space character (same as: [ \t\n\r\f])
And a substitute example for $line = "class.ex3.1.pl";
$line =~ s/\W/-/;
class-ex3.1.pl
$line =~ s/\W/-/g;
class-ex3-1-pl
Single-character patterns
And their negatives:
\D anything but a digit\W anything but a word char\S anything but a space char
![Page 22: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/22.jpg)
6.22Class exercise 6b
1. Write the following regular expressions substitutions. For each string print it before the substitution and after it
a) Replace every T with U in a DNA sequence.
b) Replace every digit in the line with a #, and print the result.
c) Replace any number of white space charactres (new-line, tab or space) by a single space.
d*) Remove all appearances of "is" from the line (both lowercase and uppercase letters), and print it.
![Page 23: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/23.jpg)
6.23
To force the pattern to be at the beginning of the string add a “^”:
m/^>/ Matches only strings that begin with a “>”
“$” forces the end of string:
m/\.pl$/ Matches only strings that end with a “.pl”
And together:
m/^\s*$/ Matches empty lines and all lines that contains only space characters.
Enforce line start/end
![Page 24: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/24.jpg)
6.24
m/\d+(\.\d+)?/ Matches numbers that may contain a decimal point:
“10”; “3.0”; “4.75” …
m/^NM_\d+/ Matches Genbank RefSeq accessions like “NM_079608”
OK… now let's do something more complex…
Some examples
![Page 25: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/25.jpg)
6.25
Let's take a look at the adeno12.gb GenBank record….
Matches annotation of a coding sequence in a Genbank DNA/RNA record:
CDS 87..1109
m/^\s*CDS\s+\d+\.\.\d+/
Allows also a CDS on the minus strand of the DNA:
CDS complement(4815..5888)
m/^\s*CDS\s+(complement\()?\d+\.\.\d+\)?/
Some GenBank examples
Note: We could just use m/^\s*CDS/ - it is a question of the strictness of the
format. Sometimes we want to make sure.
![Page 26: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/26.jpg)
6.26
We can extract parts of the pattern by parentheses:
$line = "1.35";
if ($line =~ m/(\d+)\.(\d+)/ ) {
print "$1\n"; 1
print "$2\n"; 35
}
Extracting part of a pattern
![Page 27: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/27.jpg)
6.27
We can extract parts of the string that matched parts of the pattern that are marked by
parentheses:
my $line = " CDS 87..1109";
if ($line =~ m/CDS\s+(\d+)\.\.(\d+)/ ) {
print "regexp:$1,$2\n"; regexp:87,1109
my $start = $1;
my $end = $2;
}
Extracting part of a pattern
![Page 28: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/28.jpg)
6.28
Usually, we want to scan all lines of a file, and find lines with a specific pattern. E.g.:
my ($start,$end);
foreach $line (@lines) {
if ($line =~ m/CDS\s+(\d+)\.\.(\d+)/ ) {
$start = $1; $end = $2;
...
...
}
}
Finding a pattern in an input file
![Page 29: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/29.jpg)
6.29
We can extract parts of the string that matched parts of the pattern that are marked by
parentheses. Suppose we want to match
both $line = " CDS complement(4815..5888)";
and $line = " CDS 6087..8109";
if ($line =~ m/CDS\s+(complement\()?((\d+)\.\.(\d+))\)?/ )
{
print "regexp:$1,$2,$3,$4.\n";
$start = $3; $end = $4;
}
Use of uninitialized value in concatenation...
regexp:complement(,4815..5888,4815,5888.
regexp:,6087..8109,6087,8109.
Extracting a part of a pattern
![Page 30: 6.1 Before we start ( צילום : איתן שור ) Let’s talk a bit about the last exercise, and Eclipse…](https://reader035.vdocuments.site/reader035/viewer/2022081511/56649f515503460f94c74581/html5/thumbnails/30.jpg)
6.30
Write a script that extracts and prints the following features from a Genbank record of a genome (Use adeno12.gb)
1. Print all the JOURNAL lines
2. Print all the JOURNAL lines, without the word JOURNAL, and until the first digit in the line (hint in white: match whatever is not a digit).
3. Find the JOURNAL lines and print only the page numbers
4. Find lines of protein_id in that file and extract the ids (add to your script from the previous question).
5. Find lines of coding sequence annotation (CDS) and extract the separate coordinates (get each number into a separate variable).Try to match all CDS lines… (This question is part of home ex. 4).
Class exercise 6c