intermediate perl programming todd scheetz july 18, 2001

22
Intermediate Perl Programming Todd Scheetz July 18, 2001

Upload: marvin-sharp

Post on 02-Jan-2016

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Intermediate Perl Programming Todd Scheetz July 18, 2001

Intermediate Perl Programming

Todd Scheetz

July 18, 2001

Page 2: Intermediate Perl Programming Todd Scheetz July 18, 2001

Review of Perl Concepts

Data Typesscalararrayhash

Input/Outputopen(FILEHANDLE,”filename”);$line = <FILEHANDLE>;print “$line”;

Arithmetic Operations+, -, *, /, %&&, ||, !

Page 3: Intermediate Perl Programming Todd Scheetz July 18, 2001

Review of Perl Concepts

Control Structuresifif/elseif/elsif/else

foreach

for

while

Page 4: Intermediate Perl Programming Todd Scheetz July 18, 2001

Regular Expressions

General approach to the problem of pattern matching

RE’s are a compact method for representing a set of possible strings without explicitly specifying each alternative.

For this portion of the discussion, I will be using {} to represent the scope of a set.

{A}{A,AA}

{Ø} = empty set

Page 5: Intermediate Perl Programming Todd Scheetz July 18, 2001

Regular Expressions

In addition, the [] will be used to denote possible alternatives.

[AB] = {A,B}

With just these semantics available, we can begin building simple Regular Expressions.

[AB][AB] = {AA, AB, BA, BB}AA[AB]BB = {AAABB,AABBB}

Page 6: Intermediate Perl Programming Todd Scheetz July 18, 2001

Regular Expressions

Additional Regular Expression components* = 0 or more of the specified symbol+ = 1 or more of the specified symbol

A+ = {A, AA, AAA, … }A* = {Ø, A, AA, AAA, … }

AB* = {A, AB, ABB, ABBB, … }[AB]* = {Ø, A, B, AA, AB, BA, BB, AAA, … }

Page 7: Intermediate Perl Programming Todd Scheetz July 18, 2001

Regular Expressions

What if we want a specific number of iterations?

A{2,4} = {AA, AAA, AAAA}[AB]{1,2} = {A, B, AA, AB, BA, BB}

What if we want any character except one?[^A] = {B}

What if we want to allow any symbol?

. = {A, B}

.* = {Ø, A, B, AA, AB, BA, BB, … }

Page 8: Intermediate Perl Programming Todd Scheetz July 18, 2001

Regular Expressions

All of these operations are available in Perl

Several “shortcuts”

\d = {0, 2, 3, 4, 5, 6, 7, 8, 9}\w+\s\w+ = {…, Hello World, … }

Name Definition CodeWhitespace [space, tab,

new-line]\s

Wordcharacter

[a-zA-Z_0-9] \w

Digit [0-9] \d

Page 9: Intermediate Perl Programming Todd Scheetz July 18, 2001

Pattern Matching

Perl supports built-in operations for pattern matching, substitution, and character replacement

Pattern Matching

if($line =~ m/Rn.\d+/) {...

}

In Perl, RE’s can be a part of the string rather than the whole string.

^ - beginning of string$ - end of string

Page 10: Intermediate Perl Programming Todd Scheetz July 18, 2001

Pattern Matching

Back references…

if($line =~ m/(Rn.\d+)/) {$UniGene_label = $1;

}

Page 11: Intermediate Perl Programming Todd Scheetz July 18, 2001

Regular Expressions

$file = “my_fasta_file”;open(IN, $file);$line_count = 0;while($line = <IN>) {

if($line =~ m/^\>/) {$line_count++;

}}print “There are $line_count FASTA sequences in $file.\n”;

Page 12: Intermediate Perl Programming Todd Scheetz July 18, 2001

Pattern Matching

UniGene data file

ID Bt.1TITLE Cow casein kinase II alpha …EXPRESS ;placentaPROTSIM ORG=Caenorhabditis elegans; …PROTSIM ORG=Mus musculus; PROTGI=…SCOUNT 2SEQUENCE ACC=M93665; NID=g162776; …SEQUENCE ACC=BF043619; NID=…//ID Bt.2TITLE Bos taurus cyclin-dependent …...

Page 13: Intermediate Perl Programming Todd Scheetz July 18, 2001

Pattern Matching

Let’s write a small Perl program to determine how many clusters there are in the Bos taurus UniGene file.

Page 14: Intermediate Perl Programming Todd Scheetz July 18, 2001

Pattern Matching

Now we’ll build a Perl program that can write an HTML file containing some basic links based on the Bos taurus UniGene clustering.

Important:

http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=GID_HERE&dopt=GenBank

Page 15: Intermediate Perl Programming Todd Scheetz July 18, 2001

Substitution

Pattern matching is useful for counting or indexing items, but to modify the data, substitution is required.

Substitution searches a string for a PATTERN and, if found, replaces it with REPLACEMENT.

$line =~ s/PATTERN/REPLACEMENT/;

Returns a value equal to the number of times the pattern was found and replaced.

$result = $line =~ s/PATTERN/REPLACEMENT/;

Page 16: Intermediate Perl Programming Todd Scheetz July 18, 2001

Substitution

Substitution can take several different options.specified after the final slash

The most useful areg - global (can substitute at more than one location)i - case insensitive matching

$string = “One fish, Two fish, Red fish, Blue fish.”;$string =~ s/fish/dog/g;print “$string\n”;

One dog, Two dog, Red dog, Blue dog.

Page 17: Intermediate Perl Programming Todd Scheetz July 18, 2001

Substitution

Example: Removing leading and trailing white-space

$line =~ s/^\s*(.*?)\s*$/$1/;

a *? performs a minimal match…it will stop at the first point that the remainder of the expression can be matched.

$line =~ s/^\s*(.*)\s*$/$1/;this statement will not remove trailing white-space, instead the white space is retained by the .*

Page 18: Intermediate Perl Programming Todd Scheetz July 18, 2001

Character Replacement

A similar operation to substitution is character replacement.

$line =~ tr/a-z/A-Z/;

$count_CG = $line =~ tr/CG/CG/;

$line =~ tr/ACGT/TGCA/;

$line =~ s/A/T/g;$line =~ s/C/G/g;$line =~ s/G/C/g;$line =~ s/T/A/g;

Page 19: Intermediate Perl Programming Todd Scheetz July 18, 2001

Character Replacement

while($line = <IN>) {$count_CG = $line =~ tr/CG/CG/;$count_AT = $line =~ tr/AT/AT/;

}$total = $count_CG + $count_AT;$percent_CG = 100 * ($count_CG/$total);

print “The sequence was $percent_CG CG-rich.\n”;

Page 20: Intermediate Perl Programming Todd Scheetz July 18, 2001

Subroutines

One of the most important aspects of programming is dealing with complexity. A program that is written in one large section is generally more difficult to debug. Thus a major strategy in program development is modularization.

Break the program up into smaller portions that can each be developed and tested independently.

Makes the program more readable, and easier to maintain and modify.

Page 21: Intermediate Perl Programming Todd Scheetz July 18, 2001

Subroutines

EXAMPLE:Reading in sequences from UniGene.all.seq file

Multiple FASTA sequences in a single file, each annotated with the UniGene cluster they belong to.

GOAL: Make an output file consisting only of the longest sequence from each cluster.

Page 22: Intermediate Perl Programming Todd Scheetz July 18, 2001

Subroutines

ISSUES:1. Want to design and implement a usable program2. Use subroutines where useful to reduce complexity.3. Minimize the memory requirements.

(human UniGene seqs > 2 GB)