introduction to pr ogramming: perl for · pdf fileintroduction to pr ogramming: perl for...

Post on 10-Mar-2018

223 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to Programming: Perl for Biologists

Timothy M. Kunau

Center for Biomedical Research InformaticsAcademic Health CenterUniversity of Minnesotakunau@umn.edu

Bioinformatics Summer Institute 2007

Outline

•Art and Programming

•Getting Started

•Biology and Computer Science

•Bioinformatics Data

•Perl basics:

•Strings and Variables

•Math and Logic

•Looping, operators, and functions

Art and Programming

• Moving from Data to Story

• Systems = Beauty

“Science is what we understand well enough

to explain to a computer. Art is all the rest.”

Donald Knuth

Edit -- Run -- Revise (and Save)

•As a programmer, most of your time will be spent planning, testing, and revising your program.

•Running is often incidental on today’s hardware.

•Carefully written programs can be productive tools for years.

•Programming is a method of communication: your code must be readable by both the computer and your users.

Errors and Debugging

• Rarely involves actual insects.

• If the task is well understood, errors are mostly typographical.

• these error messages can be extraordinarily helpful.

• If the task is not well understood or the data is irregular, it may produce a ‘logical’ error and require more thought.

• Beware: a valid program can still produce the wrong result.

Programming

• Is an exercise in problem solving:

• iterative

• gradual

• often a solitary activity

• Social activity

• You are now part of a community of tool builders.

• A program does not often stand alone, but interacts with other programs that make up its environment. Each building on the others.

• Systematic and beautiful

Programming

• Is an economically valuable skill.

• Commercial and proprietary systems are built to protect their economic value.

• Open Source projects are different.

• Open Source software projects publish their source code so that is can be shared and improved by the community of users.

http://www.opensource.org/

Open Source Programs

•Firefox

•LINUX

•MySQL

•Apache web server

•Languages:

•Perl

•Ruby

•Python

Programming Strategies

•Break down into two major approaches:

1. Find a program written by someone else.

2. Write one yourself.

•The reality is usually somewhere in between.

Programming Strategies

• Open Source programming communities are often large and prolific.

• If you cannot find a program that does exactly what you need -- you can likely find one that does most of what you need.

• A little tweaking is often significantly quicker than rolling your own.

• “A day in the library can save you six months in the lab.” -- ancient adage

Programming Strategies

• It is important to become aware of the communities that use and support the tools you use.

• Some copyrights may apply but use is generally free.

• CPAN

What has been will be again, what

has been done will be done again;

there is nothing new under the sun.

(Ecclesiastes 1:9 NIV)

The Process

1. Identify the inputs, data, and specifications from the user.

2. Design the solution as a series of steps toward the desired result.

3. Decide on the output(s). Does the result print to the screen or to a file? How will this output be used? Does format matter?

4. Refine the design with increasing detail. (pseudocode)

5. Do appropriate code modules exist? (CPAN)

6. Write the program.

Pseudocode

• An informal program in which there are no details and formal syntax is not followed.

• A quick and informal way to collect your ideas about solving the problem at hand.

get the name of DNA file from user

read in DNA from DNA file

for each element

if element is DNA, then add one to the count

print count

What is Perl?

•Scripting language by Larry Wall, cica 1985

•Born of AWK

•Practical Extraction and Reporting Language

•Pathologically Eclectic Rubbish Lister

•Disturbingly flexible in form, format, and usage.

•“Swiss Army chain-saw”

Why Perl?!

•An easy language to use, though sometimes hard to learn. Some choices were made to make things easier for the programmer at the expense of the student.

•Fast cross platform text processing.

•Good pattern matching. (regex)

•Many extensions for Life Sciences data types. (BioPerl)

•Many biologists already know Perl.

•Powerful

#!/usr/local/bin/perl -w

use SOAP::Lite;

print STDERR "Welcome to the SOAP demonstration\n";

my $res;

$servername = "inquiry.ccgb.umn.edu";

my $server = SOAP::Lite

-> uri("http://$servername/Backbeat")

-> proxy("http://$servername/cgi-bin/bipod/BIFX.pl");

$res = $server-> (SOAP::Data->name(USER)->value("kunau"),

SOAP::Data->name(PASSWORD)->value(” ));

my $ticket;

if ($res->result()) { $ticket = $res->result(); }

print STDERR "Got ticket $ticket\n";

my $id = "nt:ABY13260";

= $id;

=~ s/: ;

$res = $server-> (SOAP::Data->name(TICKET)->value($ticket),

SOAP::Data->name("BLOCKING")->value(1),

SOAP::Data->name("sequence")->value("$id"),

SOAP::Data->name( )->value("fasta"),

SOAP::Data->name("outseq")->value( ));

($res);

print STDERR "fetched file for $id\n";

$res = $server-> (SOAP::Data->name(TICKET)->value($ticket),

SOAP::Data->name("BLOCKING")->value(0),

SOAP::Data->name("blastall")->value("blastn"),

SOAP::Data->name("query")->value( ),

SOAP::Data->name( )->value("yeast.nt"),

SOAP::Data->name( )->value("yeast.nt"),

SOAP::Data->name( )->value( . ".blastx"));

($res);

my $jid = 0;

if ($res->result()) { $jid = $res->result(); }

print "Submitted BLAST for . Got job id $jid\n";

# Client side block

my $result = "";

while ($result ne "FINISHED") {

print "Checking status for job $jid\n";

$res = $server-> (

SOAP::Data->name("TICKET")->value($ticket),

$jid));

($res);

if ($res->result()) { $result = $res->result(); }

print "Got status $result\n";

if ($result ne "FINISHED") { sleep 3; }

}

$res = $server-> (SOAP::Data->name(TICKET)->value($ticket),

SOAP::Data->name(FILENAME)->value("blastall.txt"));

($res);

if ($res->result()) {

$result = $res->result();

print "Got status $result\n";

if ($result ne "FINISHED") { sleep 3; }

}

$res = $server-> (SOAP::Data->name(TICKET)->value($ticket),

SOAP::Data->name(FILENAME)->value("blastall.txt"));

($res);

if ($res->result()) { print $res->result(); }

###################### SUBROUTINES #####################

sub {

my $res = shift;

if (my $fault = $res->fault()) {

my %fault = %$fault;

while (my ($key, $val) = each (%fault)) {

print "$key $val\n";

}

}

}

Login

Get a ticket

Configure a service

Submit request

Check status (rinse, repeat)

Print result

Beginning Perl for

Bioinformatics

• Hardcover: 400 pages

• Publisher: O'Reilly Media, Inc.; 1

edition (October 15, 2001)

• Language: English

• ISBN: 0596000804

• Product Dimensions: 9.2 x 7.1 x

0.9 inches

• Shipping Weight: 1.3 pounds.

• Average Customer Review: 4.5/5

based on 25 reviews.

Mastering Perl for

Bioinformatics

• Hardcover: 377 pages

• Publisher: O'Reilly Media, Inc.; 1

edition (June, 2003)

• Language: English

• ISBN: 0596003072

• Product Dimensions: 9.4 x 6.8 x

0.9 inches

• Shipping Weight: 1.4 pounds.

• Average Customer Review: 4.5/5

based on 8 reviews.

Safari Books on-line

http://proquest.safaribooksonline.com/home

Safari: Perl

Safari: bioinformatics

Getting Started

•The programming rite of passage.

•Tidbits

•print “string”;

•newline: “\n”

•tab: “\t”

•# comments

•All about context

A simple program

#!/usr/bin/perl -w

#

# a program to do the obvious

#

print “Hello, world!\n”;

A simple result

% ./hello-world.pl

Hello, world!

How does it work?

#!/usr/bin/perl -w

#

# a program to do the obvious

#

print “Hello, world!\n”;

Every Perl program

begins with this line.

The ‘print’ function

sends the quoted

text to the default

output device, the

screen.

Comments

Theme and variation

#!/usr/bin/perl -w

#

# assign a value to $message

my $message = “Hello, world!\n”;

# print the $message

print $message;

Store the

value “Hello,

world!” in a

container

called a

variable.

Theme and variation

#!/usr/bin/perl -w

#

# assign a value to $message

my $message = qq{Hello, world!\n};

# print the $message

print $message;

Don’t let a

change in

form throw

you.

TMTOWTDI

•There’s More Than One Way To Do It

•This can be frustrating for new users.

•We’ll try to focus on what we’re doing. Don’t worry about all the possible ways to do it yet.

LAB: Let’s try it!

• Login to your workstation

• launch a terminal window

•mkdir bsi2007

•cd bsi2007

• launch a text editor: pico, vi, emacs

• create and save your “Hello, world!” program

• Run it

LAB: Let’s try it!

% mkdir bsi2007

% cd bsi2007

% pico hello-world.pl

% chmod +x hello-world.pl

% ./hello-world.pl

LAB: Let’s try it!

#!/usr/bin/perl -w

#

# a program to do the obvious

#

print “Hello, world!\n”;

LAB: Let’s try a little variation.

#!/usr/bin/perl -w

#

# assign a value to $message

my $message = “Hello, world!\n”;

# print the $message

print $message;

LAB: break it.

What happens when?:

1. You remove a semicolon?

2. You remove a dollar sign?

3. You change the shebang?

4. Can you change the shebang to something else that works?

lather --> rinse --> repeat

The goal of testing is to cause your code to fail. The goal of testing is not to cause your code to succeed.

D. Conway

LAB: A simple program

#!/usr/bin/perl -w

#

# a program to do the obvious

#

print “Hello, world!\n”;

LAB: A simple result

% ./hello-world.pl

Hello, world!

Biology and Computer Science

• The Life Sciences and many of the Computer Sciences grew up together.

• Databases

• Languages

• Networks

• the World Wide Web

“It is better to use one’s

head for a few minutes,

than to use a computing

machine for a few days.”

Francis Crick

A brief history

• 1950’s: Double helix structure of DNA

• 1960’s: Manual alignment using “edit distances”

• 1970’s: Optimal global alignment (Needleman & Wunsch)

• Substitution matrixes (Dayhoff)

• 1980’s: Optimal local alignment (Smith & Waterman)

• 1990’s: Heuristic local alignment search

• FASTA: (Pearson et al.),

• BLAST: (Altschul et al.)

Disconnects

•Social differences

•Managing expectations

•Developing a common vocabulary

•Conway’s Law

Conway’s Law states:

“Organizations which design systems are constrained to produce designs which are copies of the communication structures of their organizations.”

In other words:

Any piece of software reflects the organizational structure that produced it.

Social differences

• Tool building versus the great discovery:

• Computer scientists create new rules to engineer a solution. (“Inventing laws”)

• Life scientists look for the exception that breaks the rules. (“Discover laws”)

Social differences

BiologistsComputer

Scientists

Sharing resultssit on it until ready to

publish

Share but do not

guarantee correctness

Reporting results Peer reviewed papersTalks at conferences

Publish Source Code

Who’s who

(on publications)Lab leader always last

Lab leader second, least

involved last

Managing Expectations

What can we expect from each other?

Life Sciences are presenting the grand

challenges of our time...

What does Computer Science have to offer Life Sciences research?

Developing a common vocabulary

Words in common but with different meanings:

Array, chip, clone, cluster, database, domain, insert, library, node, partitioning, root, sequence, transformation, tree, vector, virus

Isn’t it odd?

Biology is the only science in which multiplication means the same thing as division.

Developing a common vocabulary

• The importance of interpreters.

• Constrained and negotiated vocabularies, Ontologies:

• “gene expression” and “Gene Expression” and “gene regulation”

• “putative kinase” and “possibly a kinase” and “it may be something, but it isn’t a kinase”

• Metadata without guidelines will lead to entropy.

• Folksonomy: in-formalisms, tagging?

• You are becoming interpreters.

Developing a common vocabulary

BioBench-Bob: “The information is in the file, what’s the problem?”

Compu-Carla: “This file is a mess! How about some consistency and structure?”

What we have here is a failure to communicate.

Compu-Carla: “The information is all in the database, why are you complaining?”

BioBench-Bob: “How do I read it?”

Conway’s Law"

“Organizations which design systems are constrained to produce designs which are copies of the communication structures of their organizations.”

Bioinformatics Data

“Quantity has a quality

all its own”

Russian military axiom

GBREL.TXT Genetic Sequence Data BankApril 15 2007

NCBI-GenBank Flat File Release 159.0Distribution Release Notes

71,802,595 loci, 75,742,041,056 bases, from 71,802,595 reported sequences

Bioinformatics Data

• Often unstructured or semi-structured.

• Data appears as text strings:

• Protein sequences: FASTA flat-files, et alia.

• Annotation: often free-text

• Feudal states (Lincoln Stein)

FASTA

>ContigId:Contig1 AssemblyProcessId:MtSC AssemblyProcessVersion:6 GCTTTAATCTTGTAGGTTTGATGAAAGAATAAGTTCGTTTGCTGAGAAGA AGTTTACAAGAGATGGTATAGAAGTTCAAACTGGATGCCGCGTTATGAGT GTTGATGACAAGGAAATTACAGTGAAGGTGAAATCAACGGGAGAGGTTTG CTCGGTTCCCCATGGATTGATTATCTGGTCTACTGGCATTTCTACTCTTC CAGTTATAAGAGATTTTATGGAAGAAATTGGTCAGACTAAAAGGCATGTA CTGGCAACCGATGAATGGTTGAGAGTGAAGGAATGTGAAGATGTGTTTGC CATTGGTGATTGTTCATCAATAAATCAACGTAAAATCATGGATGATATCT TGGACATATTTAAGGCTGCAGACAAAAATAACTCCGGTACCTTAACTGTG TAAGAATGCGAAGAAGTGATGGATGAATGTATCTTAAGATATCCTGCAGT GGAATGC

Medicago Truncatula consensus sequence

GenBank

LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.ACCESSION U49845VERSION U49845.1 GI:1293613KEYWORDS .SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) PUBMED 7871890REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), 777-793 (1996) PUBMED 8846915REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USAFEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2"

Approximately 71,802,595 loci,

75,742,041,056 bases, from 71,802,595

reported sequences in traditional

GenBank divisions as of April 2007.

GenBank

/note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL VDFSNKSNVNVGQVKDIHGRIPEML" gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK LISGDDKILNGVYSQYEEGESIFGSLF"ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctcaaagc tccttgccga

GenBank

481 gagtcgccct cctttgtcga gtaattttca cttttcatat gagaacttat tttcttattc 541 tttactctca catcctgtag tgattgacac tgcaacagcc accatcacta gaagaacaga 601 acaattactt aatagaaaaa ttatatcttc ctcgaaacga tttcctgctt ccaacatcta 661 cgtatatcaa gaagcattca cttaccatga cacagcttca gatttcatta ttgctgacag 721 ctactatatc actactccat ctagtagtgg ccacgcccta tgaggcatat cctatcggaa 781 aacaataccc cccagtggca agagtcaatg aatcgtttac atttcaaatt tccaatgata 841 cctataaatc gtctgtagac aagacagctc aaataacata caattgcttc gacttaccga 901 gctggctttc gtttgactct agttctagaa cgttctcagg tgaaccttct tctgacttac 961 tatctgatgc gaacaccacg ttgtatttca atgtaatact cgagggtacg gactctgccg 1021 acagcacgtc tttgaacaat acataccaat ttgttgttac aaaccgtcca tccatctcgc 1081 tatcgtcaga tttcaatcta ttggcgttgt taaaaaacta tggttatact aacggcaaaa 1141 acgctctgaa actagatcct aatgaagtct tcaacgtgac ttttgaccgt tcaatgttca 1201 ctaacgaaga atccattgtg tcgtattacg gacgttctca gttgtataat gcgccgttac 1261 ccaattggct gttcttcgat tctggcgagt tgaagtttac tgggacggca ccggtgataa 1321 actcggcgat tgctccagaa acaagctaca gttttgtcat catcgctaca gacattgaag 1381 gattttctgc cgttgaggta gaattcgaat tagtcatcgg ggctcaccag ttaactacct 1441 ctattcaaaa tagtttgata atcaacgtta ctgacacagg taacgtttca tatgacttac 1501 ctctaaacta tgtttatctc gatgacgatc ctatttcttc tgataaattg ggttctataa 1561 acttattgga tgctccagac tgggtggcat tagataatgc taccatttcc gggtctgtcc 1621 cagatgaatt actcggtaag aactccaatc ctgccaattt ttctgtgtcc atttatgata 1681 cttatggtga tgtgatttat ttcaacttcg aagttgtctc cacaacggat ttgtttgcca 1741 ttagttctct tcccaatatt aacgctacaa ggggtgaatg gttctcctac tattttttgc 1801 cttctcagtt tacagactac gtgaatacaa acgtttcatt agagtttact aattcaagcc 1861 aagaccatga ctgggtgaaa ttccaatcat ctaatttaac attagctgga gaagtgccca 1921 agaatttcga caagctttca ttaggtttga aagcgaacca aggttcacaa tctcaagagc 1981 tatattttaa catcattggc atggattcaa agataactca ctcaaaccac agtgcgaatg 2041 caacgtccac aagaagttct caccactcca cctcaacaag ttcttacaca tcttctactt 2101 acactgcaaa aatttcttct acctccgctg ctgctacttc ttctgctcca gcagcgctgc 2161 cagcagccaa taaaacttca tctcacaata aaaaagcagt agcaattgcg tgcggtgttg 2221 ctatcccatt aggcgttatc ctagtagctc tcatttgctt cctaatattc tggagacgca 2281 gaagggaaaa tccagacgat gaaaacttac cgcatgctat tagtggacct gatttgaata 2341 atcctgcaaa taaaccaaat caagaaaacg ctacaccttt gaacaacccc tttgatgatg 2401 atgcttcctc gtacgatgat acttcaatag caagaagatt ggctgctttg aacactttga 2461 aattggataa ccactctgcc actgaatctg atatttccag cgtggatgaa aagagagatt 2521 ctctatcagg tatgaataca tacaatgatc agttccaatc ccaaagtaaa gaagaattat 2581 tagcaaaacc cccagtacag cctccagaga gcccgttctt tgacccacag aataggtctt 2641 cttctgtgta tatggatagt gaaccagcag taaataaatc ctggcgatat actggcaacc 2701 tgtcaccagt ctctgatatt gtcagagaca gttacggatc acaaaaaact gttgatacag 2761 aaaaactttt cgatttagaa gcaccagaga aggaaaaacg tacgtcaagg gatgtcacta 2821 tgtcttcact ggacccttgg aacagcaata ttagcccttc tcccgtaaga aaatcagtaa 2881 caccatcacc atataacgta acgaagcatc gtaaccgcca cttacaaaat attcaagact 2941 ctcaaagcgg taaaaacgga atcactccca caacaatgtc aacttcatct tctgacgatt 3001 ttgttccggt taaagatggt gaaaattttt gctgggtcca tagcatggaa ccagacagaa 3061 gaccaagtaa gaaaaggtta gtagattttt caaataagag taatgtcaat gttggtcaag

GenBank

3121 ttaaggacat tcacggacgc atcccagaaa tgctgtgatt atacgcaacg atattttgct 3181 taattttatt ttcctgtttt attttttatt agtggtttac agatacccta tattttattt 3241 agtttttata cttagagaca tttaatttta attccattct tcaaatttca tttttgcact 3301 taaaacaaag atccaaaaat gctctcgccc tcttcatatt gagaatacac tccattcaaa 3361 attttgtcgt caccgctgat taatttttca ctaaactgat gaataatcaa aggccccacg 3421 tcagaaccga ctaaagaagt gagttttatt ttaggaggtt gaaaaccatt attgtctggt 3481 aaattttcat cttcttgaca tttaacccag tttgaatccc tttcaatttc tgctttttcc 3541 tccaaactat cgaccctcct gtttctgtcc aacttatgtc ctagttccaa ttcgatcgca 3601 ttaataactg cttcaaatgt tattgtgtca tcgttgactt taggtaattt ctccaaatgc 3661 ataatcaaac tatttaagga agatcggaat tcgtcgaaca cttcagtttc cgtaatgatc 3721 tgatcgtctt tatccacatg ttgtaattca ctaaaatcta aaacgtattt ttcaatgcat 3781 aaatcgttct ttttattaat aatgcagatg gaaaatctgt aaacgtgcgt taatttagaa 3841 agaacatcca gtataagttc ttctatatag tcaattaaag caggatgcct attaatggga 3901 acgaactgcg gcaagttgaa tgactggtaa gtagtgtagt cgaatgactg aggtgggtat 3961 acatttctat aaaataaaat caaattaatg tagcatttta agtataccct cagccacttc 4021 tctacccatc tattcataaa gctgacgcaa cgattactat tttttttttc ttcttggatc 4081 tcagtcgtcg caaaaacgta taccttcttt ttccgacctt ttttttagct ttctggaaaa 4141 gtttatatta gttaaacagg gtctagtctt agtgtgaaag ctagtggttt cgattgactg 4201 atattaagaa agtggaaatt aaattagtag tgtagacgta tatgcatatg tatttctcgc 4261 ctgtttatgt ttctacgtac ttttgattta tagcaagggg aaaagaaata catactattt 4321 tttggtaaag gtgaaagcat aatgtaaaag ctagaataaa atggacgaaa taaagagagg 4381 cttagttcat cttttttcca aaaagcaccc aatgataata actaaaatga aaaggatttg 4441 ccatctgtca gcaacatcag ttgtgtgagc aataataaaa tcatcacctc cgttgccttt 4501 agcgcgtttg tcgtttgtat cttccgtaat tttagtctta tcaatgggaa tcataaattt 4561 tccaatgaat tagcaatttc gtccaattct ttttgagctt cttcatattt gctttggaat 4621 tcttcgcact tcttttccca ttcatctctt tcttcttcca aagcaacgat ccttctaccc 4681 atttgctcag agttcaaatc ggcctctttc agtttatcca ttgcttcctt cagtttggct 4741 tcactgtctt ctagctgttg ttctagatcc tggtttttct tggtgtagtt ctcattatta 4801 gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac 4861 ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct 4921 ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct 4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc//

SWISS-Prot

• ID DMI1_MEDTR STANDARD; PRT; 882 AA.

• AC Q6RHR6;

• DT 29-MAR-2005, integrated into UniProtKB/Swiss-Prot.

• DT 05-JUL-2004, sequence version 1.

• DT 04-APR-2006, entry version 13.

• DE Putative ion channel DMI-1 (Does not make infections protein 1).

• GN Name=DMI1;

• OS Medicago truncatula (Barrel medic).

• OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

• OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;

• OC rosids; eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae;

• OC Medicago.

• OX NCBI_TaxID=3880;

• RN [1]

• RP NUCLEOTIDE SEQUENCE [MRNA], INDUCTION, AND TISSUE SPECIFICITY.

• RC TISSUE=Root;

• RX PubMed=14963334; DOI=10.1126/science.1092986;

• RA Ane J.-M., Kiss G.B., Riely B.K., Penmetsa R.V., Oldroyd G.E.,

• RA Ayax C., Levy J., Debelle F., Baek J.-M., Kalo P., Rosenberg C.,

• RA Roe B.A., Long S.R., Denarie J., Cook D.R.;

• RT "Medicago truncatula DMI1 required for bacterial and fungal symbioses

• RT in legumes.";

• RL Science 303:1364-1367(2004).

• CC -!- FUNCTION: Required for early signal transduction events leading to

• CC endosymbioses. Acts early in a signal transduction chain leading

• CC from the perception of Nod factor to the activation of calcium

• CC spiking. Also involved in mycorrhizal symbiosis.

• CC -!- SUBCELLULAR LOCATION: Plastid; chloroplast; chloroplast membrane;

• CC multi-pass membrane protein (Potential).

• CC -!- TISSUE SPECIFICITY: Mainly expressed in roots. Also detected in

• CC pods, flowers, leaves, and stems.

• CC -!- INDUCTION: Not induced after bacterial or Nod factor treatment.

• CC -!- SIMILARITY: Belongs to the castor/pollux family.

• CC -----------------------------------------------------------------------

• CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms

• CC Distributed under the Creative Commons Attribution-NoDerivs License

• CC -----------------------------------------------------------------------

• DR EMBL; AY497771; AAS49490.1; -; mRNA.

• KW Chloroplast; Coiled coil; Ion transport; Ionic channel; Membrane;

• KW Plastid; Transmembrane; Transport.

• FT CHAIN 1 882 Putative ion channel DMI-1.

• FT /FTId=PRO_0000165855.

• FT TRANSMEM 129 149 Potential.

• FT TRANSMEM 192 212 Potential.

• FT TRANSMEM 255 275 Potential.

• FT TRANSMEM 307 327 Potential.

• FT COILED 378 403 Potential.

• FT COMPBIAS 78 96 Pro-rich.

• FT COMPBIAS 114 117 Poly-Ser.

Release 53.0 of 29-May-07 of UniProtKB/Swiss-Prot contains 269,293 sequence entries,comprising 98,902,758 amino acids abstracted from 156,204 references.

Tower of Babel, Pieter Brueghel the Elder, 1563.

XML

<?xml version="1.0" encoding="UTF-8"?><uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"><entry dataset="Swiss-Prot" created="2005-03-29" modified="2006-04-04" version="13"> <accession>Q6RHR6</accession> <name>DMI1_MEDTR</name> <protein> <name>Putative ion channel DMI-1</name> <name>Does not make infections protein 1</name> </protein> <gene> <name type="primary">DMI1</name> </gene> <organism key="1"> <name type="scientific">Medicago truncatula</name> <name type="common">Barrel medic</name> <dbReference type="NCBI Taxonomy" id="3880" key="2"/> <lineage> <taxon>Eukaryota</taxon> <taxon>Viridiplantae</taxon> <taxon>Streptophyta</taxon> <taxon>Embryophyta</taxon> <taxon>Tracheophyta</taxon> <taxon>Spermatophyta</taxon> <taxon>Magnoliophyta</taxon> <taxon>eudicotyledons</taxon> <taxon>core eudicotyledons</taxon> <taxon>rosids</taxon> <taxon>eurosids I</taxon> <taxon>Fabales</taxon> <taxon>Fabaceae</taxon> <taxon>Papilionoideae</taxon> <taxon>Trifolieae</taxon> <taxon>Medicago</taxon> </lineage> </organism> <reference key="3"> <citation type="journal article" date="2004" name="Science" volume="303" first="1364" last="1367"> <title>Medicago truncatula DMI1 required for bacterial and fungal symbioses in legumes.</title> <authorList> <person name="Ane J.-M."/> <person name="Kiss G.B."/> <person name="Riely B.K."/> <person name="Penmetsa R.V."/> <person name="Oldroyd G.E."/> <person name="Ayax C."/> <person name="Levy J."/>

Two principle problems in bioinformatics

•distribution: data is created and controlled by autonomous groups all over the world.

•biology is hard and messy: large collections of data, many numbers of data types and tools; few of which talk to each other.

Perl is often the glue that binds these systems together.

Perl basics: Strings

•Primitives:

•Strings

•Numerics

TGACATGCTAGCTAGCTAGCTAT

1356

#@$!$!%@&&!@

Data types

• Scalar: a variable quantity that cannot be resolved into components.

• List: a collection of items, often stored in an array.

• Hash: a dish of cooked meat cut into small pieces and re-cooked, usually with potatoes.

Data types

• Scalar: a variable quantity that cannot be resolved into components.

• List: a collection of items, often stored in an array.

• Hash: like an array, but instead of indexing values by number, values are accessed by name. Think of them as name-value pairs.

Data types

• Scalar: my $var = “a”;

•my $num = 10;

• List: my @fruit_list = (‘apple’,‘orange’,‘banana’);

• Hash:

my %ip2hostname = (

“160.94.109.65” => “leaf.cbri.umn.edu”,

“160.94.109.55” => “blastoma.cbri.umn.edu”,

“160.94.109.211” => “kierkegaard.cbri.umn.edu”

);

Math

• Standard arithmetic: +, -, *, /

• modulus operator: %

•4 % 2 = 0 and 5 % 2 = 1

• Operate in place: $num += 3;

• Increment and decrement variable: $i++, $a--

• power: 2**5

• Square-root: sqrt(9)

Some Math Code

# Pythagorean theoremmy $a = 3; my $b = 4;my $c = sqrt($a**2 + $b**2);

# what’s left over from the divisionmy $x = 22; my $y = 6;my $div = int ( $x / $y );my $mod = $x % $y;print “output: ”, $div, “ “, $mod, “\n”;

output: 3 4

Logic and Equality

•if / unless / elsif / else

•if( TEST ) { DO SOMETHING }elsif( TEST ) { SOMETHING ELSE }else { DO SOMETHING ELSE IN CASE }

• Equality: == (numbers) and eq (strings)

• Numeric Less/Greater than: <, <=, >, >=

• String (lexical) comparisons: lt, le, gt, ge

Testing equality

my $str1 = “mumbo”;my $str2 = “jumbo”;

if( $str1 eq $str2 ) { print “strings are equal\n”;}

if( $str1 lt $str2 ) { print “less”; }} else { print “more\n”;}

Testing equality

my $str1 = “mumbo”;my $str2 = “jumbo”;

if( $str1 eq $str2 ) { print “strings are equal\n”;}

if( $str1 lt $str2 ) { print “less”; }} else { print “more\n”;}

Testing equality

my $num1 = “10”;my $num2 = “100”;

if( $num1 == $num2 ) { print “nums are equal\n”;}

if( $num1 < $num2 ) { print “less”; }} else { print “more\n”;}

Boolean Logic

• AND: && and

• OR: || or

• NOT: ! not

if( $a > 10 && $a <= 20) {

do something interesting here;

}

Loops

•while( TEST ) { }until( ! TEST ) { }

•for( $i = 0 ; $i < 10; $i++ ) {}

•foreach $item ( @list ) { }

•for $item ( @list ) { }

Using logic

for( $i = 0; $i < 20; $i++ ) { if( $i == 0 ) { print “$i is 0\n”; } elsif( $i / 2 == 0) { print “$i is even\n”; } else { print “$i is odd\n”; }}

Using logic: subtile

for( $i = 0; $i < 20; $i++ ) { if( $i == 0 ) { print “$i is 0\n”; } elsif( $i % 2 == 0) { print “$i is even\n”; } else { print “$i is odd\n”; }}

Using logic: looping

for( $i = 0; $i < 20; $i++ ) { if( $i == 0 ) { print “$i is 0\n”; } elsif( $i % 2 == 0) { print “$i is even\n”; } else { print “$i is odd\n”; }}

Using logic: comparing

for( $i = 0; $i < 20; $i++ ) { if( $i == 0 ) { print “$i is 0\n”; } elsif( $i % 2 == 0) { print “$i is even\n”; } else { print “$i is odd\n”; }}

What is truth?

•True

•if( “zero” ) {}

•if( 23 || -1 || ! 0) {}

•$x = “0 or none”; if( $x )

•False

•if( 0 || undef || ‘’ || “0” ) { }

Special variables

• This is why many people dislike Perl.

• Too many little silly things to remember.

• One of the trade-offs that make it harder to learn and ultimately easier to use.

• perldoc perlvar

for more detailed information.

Some special variables

•$! : error messages here

•$, : separator when doing print “@array”;

•$/ : record delimiter (“\n” usually)

•$a,$b : used in sorting

•$_ : implicit variable

•perldoc perlvar for more info

The Implicit variable?

•Implicit variable is $_

•It is the last thing were were thinking about.

•Examples:

for ( @list ) { print $_ };

while(<IN>) { print $_};

Some operators imbedded functions

• tr///: transliteration from one group of characters to another.

• lc, lcfirst

• uc, ucfirst

• chomp: removes the line endings from all elements of a list; returning the (total) number of chars removed.

• chop: chops off the last character on all elements of a list; returns the last chopped char.

LAB: Math

#!/usr/bin/perl -w## assign valuesmy $num1 = 22;my $num2 = 7;

my $result = $num1 / $num2;

# print the resultprint $result;

% pico pi.pl

% chmod +x pi.pl

% ./pi.pl

LAB: Math

#!/usr/bin/perl -w## assign valuesmy $num1 = 22;my $num2 = 7;

my $result = int($num1 / $num2);

# print the resultprint $result;

% pico pi.pl

% chmod +x pi.pl

% ./pi.pl

LAB: Math

#!/usr/bin/perl -w## assign valuesmy $num1 = 22;my $num2 = 7;

my $result = $num1 % $num2;

# print the resultprint $result;

% pico pi.pl

% chmod +x pi.pl

% ./pi.pl

LAB: break it.

What happens when?:

1.You change the operation?

2.You change the values?

3.You put the numbers in quotes?

4.Add another number and multiply the result?

LAB: Loops and logic

#!/usr/bin/perl -w

for( $i = 0; $i < 20; $i++ ) { if( $i == 0 ) { print “$i is 0\n”; } elsif( $i % 2 == 0) { print “$i is even\n”; } else { print “$i is odd\n”; }}

% pico loops-and-logic.pl

% chmod +x loops-and-logic.pl

% ./loops-and-logic.pl

LAB: Loops and logic

% ./loops-and-logic.pl 0 is 01 is odd2 is even3 is odd4 is even5 is odd6 is even7 is odd8 is even9 is odd10 is even11 is odd12 is even13 is odd14 is even15 is odd16 is even17 is odd18 is even19 is odd

LAB: Loops and logic

#!/usr/bin/perl -w

foreach $item ( “contig”, “seq”, “phrap” ) { if( $item eq “phrap” ) { print “Is there a phred file for this ‘$item’ file?\n”; } elsif( $item eq “seq”) { print “Is ‘$item’ in FASTA format?\n”; } else { print “’$item’ is an unknown type.\n”; }

}

% pico foreach.pl

% chmod +x foreach.pl

% ./foreach.pl

LAB: Loops and logic

#!/usr/bin/perl -w

my @items = ( “contig”, “seq”, “phrap” );

foreach $item ( @items ) { if( $item eq “phrap” ) { print “Is there a phred file for this ‘$item’ file?\n”; } elsif( $item eq “seq”) { print “Is ‘$item’ in FASTA format?\n”; } else { print “’$item’ is an unknown type.\n”; }

}

% pico foreach.pl

% chmod +x foreach.pl

% ./foreach.pl

LAB: break it.

What happens when?:

1. You change the test?

2. You change the values?

3. Test with booleans?

LAB: operators and

functions

#!/usr/bin/perl -w# Transcribing DNA into RNA

# The DNAmy $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# Print the DNA onto the screenprint "Here is the starting DNA:\n\n";

print "$DNA\n\n";

# Transcribe the DNA to RNA by substituting all T's with U's.my $RNA = $DNA;

$RNA =~ s/T/U/g;

# Print the RNA onto the screenprint "Here is the result of transcribing the DNA to RNA:\n\n";

print "$RNA\n";

% pico transcribe.pl

% chmod +x transcribe.pl

% ./transcribe.pl

LAB: operators and

functions

#!/usr/bin/perl -w# Transcribing DNA into RNA

# The DNAmy $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';

# Print the DNA onto the screenprint "Here is the starting DNA:\n\n";

print "$DNA\n\n";

# Transcribe the DNA to RNA by substituting all T's with U's.my $RNA = $DNA;

$RNA =~ s/T/U/g;

# Print the RNA onto the screenprint "Here is the result of transcribing the DNA to RNA:\n\n";

print "$RNA\n";

% pico transcribe.pl

% chmod +x transcribe.pl

% ./transcribe.pl

LAB: break it.

What happens when?:

1.You change the case?

2. Change the case with different methods? (tr///, \L, \U, lc(), uc() )

3.You reverse the sequence?

If you remember nothing else

•biology is hard and messy.

•The key problems are social. Together we are smarter than any one of us.

•Technology is easy by comparison.

Parting Thoughts: an assignment.

1. Calculate the reverse complement of a DNA strand using the tr/// operation.

2. Read about file handling. (Safari on-line documentation is available.)

3. Read about Regular Expressions (regex). (Safari)

4. Find CPAN.ORG and locate a module that would be useful to you as a biologist.

5. Read about that module and email me (kunau@umn.edu) the following details:

1. Name of the module.

2. The name of the person who wrote it.

3. What it does.

4. How it would be useful to you?

Questions?

Thank You.

top related