1.1 perl programming for biology g.s. wise faculty of life science tel aviv university, israel...

39
1.1 Perl Programming for Perl Programming for Biology Biology G.S. Wise Faculty of Life Science Tel Aviv University, Israel October 2009 David Burstein and Ofir Cohen

Post on 22-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

1.1

Perl Programming for Perl Programming for BiologyBiology

G.S. Wise Faculty of Life ScienceTel Aviv University, Israel

October 2009

David Burstein and Ofir Cohen

1.2Why biologists need Why biologists need computers?computers? Collecting and managing data

http://www.ncbi.nlm.nih.gov/ Searching databases

http://www.ncbi.nlm.nih.gov/BLAST/ Interpreting data

Protein function prediction - http://smart.embl-heidelberg.de/

Gene expression - http://www.bioconductor.org/ Browsing genomes - http://genome.ucsc.edu/

1.3

Why biologists need to Why biologists need to program?program?

(or: why are you here?) (or: why are you here?)

1.4 Why biologists need Why biologists need to to programprogram??

A real life exampleA real life exampleProto-oncogene activation by retroviral insertional mutagenesisc-Myc: a proto-oncogene that is activated due to over- or misexpression.(In w.t. cells c-Myc is a transcription factor expressed mainly during the G1 phase).

1.5

A real life exampleA real life example

Shmulik

>tumor1TAGGAAGACTGCGGTAAGTCGTGATCTGAGCGGTTCCGTTACAGCTGCTACCCTCGGCGGGGAGAGGGAAGACGCCCTGCACCCAGTGCTG...>tumor157

Run BLAST: http://www.ncbi.nlm.nih.gov/BLAST/and save it to a text file:

Score ESequences producing significant alignments: (bits) Valueref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic... 186 1e-45ref|NT_039353.4|Mm6_39393_34 Mus musculus chromosome 6 genomic c... 38 0.71 ref|NT_039477.4|Mm9_39517_34 Mus musculus chromosome 9 genomic c... 36 2.8 ref|NT_039462.4|Mm8_39502_34 Mus musculus chromosome 8 genomic c... 36 2.8 ref|NT_039234.4|Mm3_39274_34 Mus musculus chromosome 3 genomic c... 36 2.8 ref|NT_039207.4|Mm2_39247_34 Mus musculus chromosome 2 genomic c... 36 2.8

>ref|NT_039621.4|Mm15_39661_34 Mus musculus chromosome 15 genomic contig, strain C57BL/6J Length = 64849916

Score = 186 bits (94), Expect = 1e-45 Identities = 100/102 (98%) Strand = Plus / Plus Query: 1 taggaagactgcggtaagtcgtgatctgagcggttccgttacagctgctaccctcggcgg 60 ||||||||||||||| ||||||||||||||||||||||| ||||||||||||||||||||Sbjct: 23209391 taggaagactgcggtgagtcgtgatctgagcggttccgtaacagctgctaccctcggcgg 23209450

...

...

1.6 A Perl script can do it for A Perl script can do it for youyou

Shmulik writes a simple Perl script to parse blast results and find all hits that are in the myc locus, or up to 10kbp from it:

• Use the "Blast reading" package

• Open and read file “mice.blast”

• Iteration – for each blast result:

• If we hit the genomic sequence “Mm15_39661_34”

• In the coordinates of the Myc locus (±10kbp) (23,198,120 .. 23,223,004)

• Then print this hit (hit number and position in locus)

1.7 A Perl script can do it for A Perl script can do it for youyou

use Bio::SearchIO;

my $blast_report = new Bio::SearchIO ('-format'=>'blast',

'-file' =>'mice.blast');

while (my $result = $blast_report->next_result)

{

print "Checking query ", $result->query_name, "...\n";

my $hit = $result->next_hit();

my $hsp = $hit->next_hsp();

if ($hit->name() =~ m/Mm15_39661_34/

&& $hsp->hit->start() > 23198120

&& $hsp->hit->end() < 23223004)

{

print " hit ", $hit->name();

print " (at position ", $hsp->hit->start(), ")\n";

}

}

Shmulik writes a simple Perl script to parse blast results and find all hits that are in the myc locus, or up to 10kbp from it:

Use the "Blast reading" package Open file “mice.blast”

Iterate over all blast results

For each blast hit – ask if we hit the genomic sequence “Mm15_39661_34” in

the coordinates of the Myc locus 23,198,120..23,223,004If so – print hit name

and position

1.8 A Perl script can do it for A Perl script can do it for youyou

Checking query tumor1...

hit ref|NT_039621.4|Mm15_39661_34 (at position 23209391)

Checking query tumor2...

Checking query tumor3...

Checking query tumor4...

hit ref|NT_039621.4|Mm15_39661_34 (at position 23211826)

Checking query tumor5...

Checking query tumor6...

Checking query tumor7...

hit ref|NT_039621.4|Mm15_39661_34 (at position 23210877)

Checking query tumor8...

Checking query tumor9...

Checking query tumor10...

Checking query tumor11...

hit ref|NT_039621.4|Mm15_39661_34 (at position 23213713)

Checking query tumor12...

1.9

What is Perl ?What is Perl ?

• Perl was created by Larry Wall. (read his forward to the book “Learning Perl”)

Perl = Practical Extraction and Report Language(or: Pathologically Eclectic Rubbish Lister)

• Perl is an Open Source project

• Perl is a cross-platform programming language.

1.10

Why Perl ?Why Perl ?

• Perl is an Open Source project • Perl is a cross-platform programming language.

• Perl is a very popular programming language, especially for bioinformatics• Perl allows a rapid development cycle• Perl is strong in text manipulation• Perl can easily handle files and directories• Perl can easily run other programs

1.11

Perl & biologyPerl & biology

BioPerl: “An international association of developers of

open source Perl tools for bioinformatics, genomics

and life science research”

http://bioperl.org/

Many smaller projects, and millions of little pieces of

biological Perl code (which should be used as

references – google and find them!)

1.12

This courseThis course No prior knowledge expected: intended for students with no

experience in programming whatsoever. Time consuming: requires more hours than your average

seminar… For you: oriented towards programming tasks for molecular

biology

1.13

Some formalities…Some formalities… Use the course web page: http://ibis.tau.ac.il/perluser/2010/

Presentations will be available on the morning of the class.

There will be 5-7 exercises, amounting to 30% of your grade. You get full points if you do the whole exercise, even if some of your answers are wrong, but genuine effort is evident.

Exercises are for individual practice. DO NOT submit exercises in pairs or copy exercises from anyone.

1.14

Some formalities…Some formalities… Submit your exercises by email to your teacher

(either Dudu [email protected] or Ofir [email protected]) and you will be replied with feedback.

There will be a final exam on computers. Both learning groups will be taught the same

material each week. Presentations are in English, lessons – given in

Hebrew.

1.15

Email list for the courseEmail list for the course

Everybody send us an email ([email protected] and

[email protected]) please write that you’re taking the

course (even if you are not enrolled yet).

Please let us know: To which group you belong

Whether you are a undergraduate student, graduate (M.Sc. /

Ph.D.) student or other

Whether you have any programming background

1.16

Example exercisesExample exercises

Ex. 1: Write a script that prints "I will submit my

homework on time" 100 times(by the end of this lesson! )

Ex. 3: Read a GenBank file and print coordinates

of ORFs

Ex. 5: Write a module of functions for reading

sequence files and identification of palindromes

1.17

A first Perl script

print "Hello world!";

A Perl statement must end with a semicolon “;”

The print function outputs some information to the terminal screen

Compare this to Java's "Hello world":

public class HelloWorld {

public static void main(String[] args) {

System.out.println("Hello World!!");

}

}

1.18

Data Type Description

scalar A single number or string value

9 -17 3.1415 "hello"

array An ordered list of scalar values

(9,-15,3.5)

associative array Also known as a “hash”. Holds an unordered list of key-value couples.

('dudu' => '[email protected]'

'ofir' => '[email protected]')

Data types

1.19

Scalar Data

1.20

A scalar is either a string or a number.

Numerical values 3 -20 3.14152965

1.3e4 (= 1.3 × 104 = 1,300)

6.35e-14 ( = 6.35 × 10-14)

Scalar values

1.21

Single-quoted strings

print 'hello world';hello world

Double-quoted strings

print "hello world";hello world

print "hello\tworld";hello world

print 'a backslash-t: \t ';a backslash-t: \t

ConstructMeaning

\nNewline

\tTab

\\Backslash

\”Double quote

Strings

Backslash is an “escape” character that gives the next character a special meaning:

print "a backslash: \\ ";a backslash: \

print "a double quote: \" ";a double quote: "

Scalar values

1.22

Operators

An operator takes some values (operands), operates on them, and produces a new value.

Numerical operators: + - * / ** (exponentiation) ++ -- (autoincrement, will talk about them later)

print 1+1; 2

print ((1+1)**3); 8

1.23

Operators

An operator takes some values (operands), operates on them, and produces a new value.

String operators: . (concatenate) x (replicate)

e.g.

print ('swiss'.'prot'); swissprot

print (('swiss'.'prot')x3); swissprotswissprotswissprot

1.24

String or number?

Perl decides the type of a value depending on its context:

(9+5).'a'

14.'a'

'14'.'a'

'14a'

Warning: When you use parentheses in print make sure to put one pair of parantheses around the WHOLE expression:

print (9+5).'a'; # wrong

print ((9+5).'a'); # right

You will know that you have such a problem if you see this warning:

print (...) interpreted as function at ex1.pl line 3.

(9x2)+1

('9'x2)+1

'99'+1

99+1

100

1.25

Variables

Scalar variables can store scalar values.

Variable declaration my $priority;

Numerical assignment $priority = 1;

String assignment $priority = 'high';

Assign the value of variable $b to $a

$a = $b;

Note: Here we make a copy of $b in $a.

1.26

Variables - notes and tipsTips:• Give meaningful names to variables: e.g. $studentName is better than $n• Always use an explicit declaration of the variables using the my function

Note: Variable names in Perl are case-sensitive. This means that the following variables are different (i.e. they refer to different values):$varname = 1; $VarName = 2;$VARNAME = 3;

Note: Perl has a long list of scalar special variables ($_, $1, $2,…) So please don’t use them!

1.27

Variables - always use strict!

Always include the line: use strict;as the first line of every script.• “Strict” mode forces you to declare all variables by my.• This will help you avoid very annoying bugs, such as spelling mistakes in the names of variables.

my $varname = 1; $varName++;

Warning:Global symbol "$varName" requires explicit package name at ... line ...

1.28

Interpolating variables into strings

$a = 9.5;print "a is $a!\n";

a is 9.5!

Reminder:print 'a is $a!\n';

a is $a!\n

1.29

Command-line interface

1.30Running Perl at the Command Line

Traditionally, Perl scripts are run from a command line interface

(Similar to the old DOS).

(Start it by clicking: Start Accessories Command Prompt

or: Start Run… cmd )

Running a Perl script

perl -w YOUR_SCRIPT_NAME

(To check if Perl is installed in your computer use the ‘perl -v’ command)

1.31

Common DOS commands:

d: change to other drive (d in this case)

md my_dir make a new directory

cd my_dir change directory

cd .. move one directory up

dir list files (dir /p to view it page by page)

help list all dos commands

help dir get help on a dos command

<TAB> (hopefully) auto-complete

<up/down> go to previous/next command

<Ctrl>-c Emergency exit

More tips about the command line are founds here.

Running Perl at the Command Line

1.32

Our first Perl script

print "Hello world!";

A Perl statement must end with a semicolon “;”The print function outputs some information to the terminal screen

Try it yourself!• Use Notepad to write the script in a file named “hello.pl” (Save it in D:\perl_ex)

• Run it!

• Click Start Accessories Command Prompt or: Start Run… cmd

• Change to the right drive ("D:") and change directory to the directory that holds the Perl script ("cd perl_ex").

• Type perl -w script_name.pl (replace script_name.pl with the name of the script)

1.33

Class exercise 1• Create a directory in drive D: called "perl_ex".• Open a new file (text file) called "perl_ex1.pl"• Write a Perl script that prints the following lines:

1. The string “hello world! hello Perl!”

2. Use the operator “.” to concatenate the words “apple!”,

“orange!!” and “banana!!!”

3*. Produce the line: “666:666:666:god help us!”

without any 6 and with only one : in your script!

Like so:

hello world! hello Perl!

apple!orange!!banana!!!

666:666:666:god help us!

1.34

Reading input<STDIN> allows us to get input from the user:

print "What is your name?\n";my $name = <STDIN>;print "Hello $name!";

Here is a test run:

What is your name? Shmulik Hello Shmulik !

$name: "Shmulik\n"

1.35

$name: "Shmulik\n"

Reading inputUse the chomp function to remove the “new-line” from the end of the string (if there is any):

print "What is your name?\n";my $name = <STDIN>;chomp $name; # Remove the new-line print "Hello $name!";

Here is a test run:

What is your name? Shmulik Hello Shmulik!

$name: "Shmulik"

1.36

The length function

The length function returns the length of a string: print length("hi you"); 6Actually print is also a function so you could write: print(length("hi you")); 6

1.37

The substr functionThe substr function extracts a substring out of a string. It receives 3 arguments: substr(EXPR,OFFSET,LENGTH)

For example:$str = "university"; $sub = substr ($str, 3, 5);$sub is now "versi", and $str remains unchanged.

Note: If length is omitted, everything to the end of the string is returned. You can use variables as the offset and length parameters.The substr function can do a lot more, google it and you will see…

1.38

Documentation of perl functions

Anothr good place to start is the list of All basic Perl functions in the Perl documentation site:http://perldoc.perl.org/Click the link “Functions” on the left (let's try it…)

1.39

Home exercise 1 – submit by email until next class

1. Install Perl on your computer. Use Notepad to write scripts.2. Write a script that prints "I will submit my homework on time" 100 times.3. Write a script that assigns your e-mail address into the variable $email and

then prints it.4. Write a script that reads a line and prints the length of it.5. Write a script that reads a line and prints the first 3 characters.6*. Write a script that reads 4 inputs:

• text line• number representing "start" position• number representing "end" position• number representing "copies.and then prints the letters of the text between the "start" and "end" positions (including the "end"), duplicated "copies" times.

(an example is given in the Ex1.doc on the course web site)

* Kohavit questions are a little tougher, and are not mandatory