biopython programming for engineers in python 1. classes class : statement_1. statement_n the...

Click here to load reader

Upload: imogen-king

Post on 15-Jan-2016

239 views

Category:

Documents


0 download

TRANSCRIPT

Programming for Engineers in Python

Biopython

Programming for Engineers in Python11Classesclass :statement_1..statement_n

The methods of a class get the instance as the first parameter traditionally named selfThe method __init__ is called upon object construction (if available)2ClassesReminder: type = data representation + behavior.Classes are user-defined types.

class :statement_1..statement_n

Objects of a class are called class instances.3Like a mini-program:Variables.Function Definitions.Even arbitrary commands.

Classes Attributes and Methods4MethodsInstanceAttributes(each instance has its own copy)class Vector2D:

def __init__ (self, x, y):self.x, self.y = x, y

def size (self):return (self.x ** 2 + self.y ** 2) ** 0.5

4>>> v = Vector2D(3, 4) # Make instance.>>> v

>>> v.size() # Call method on instance.5.0

Classes Instantiate and Use5Example Multimap6A dictionary with more than one value for each keyWe already needed it once or twice and used:>>> lst = d.get(key, [ ])>>> lst.append(value)>>> d[key] = lstWe will now write a new class that will be a wrapper around a dict The class will have methods that allow us to keep multiple values for each keyMultimap. partial code7class Multimap:def __init__(self):'''Create an empty Multimap'''self.inner = inner

def get(self, key):'''Return list of values associated with key'''return self.inner.get(key, [])

def put(self, key, value):'''Adds value to the list of values associated with key'''value_list = self.get(key)if value not in value_list:value_list.append(value)self.inner[key] = value_list Multimap put_all and remove8def put_all(self, key, values):for v in values: self.put(key, v)def remove(self, key, value):value_list = self.get(key)if value in value_list:value_list.remove(value)self.inner[key] = value_listreturn Truereturn False Multimap. Partial code9def __len__(self):'''Returns the number of keys in the map'''return len(self.inner)

def __str__(self):'''Converts the map to a string'''return str(self.inner)

def __cmp__(self, other):'''Compares the map with another map'''return self.inner.cmp(other)

def __contains__(self, key):'''Returns True if key exists in the map'''return self.has_key(k) Multimap10Use case a dictionary of countries and their cities:>>> m = Multimap()>>> m.put('Israel', 'Tel-Aviv')>>> m.put('Israel', 'Jerusalem')>>> m.put('France', 'Paris')>>> m.put_all('England',('London', 'Manchester', 'Moscow'))>>> m.remove('England', 'Moscow') >>> print m.get('Israel')['Tel-Aviv', 'Jerusalem']

11

BiopythonAn international association of developers of freely available Python (http://www.python.org) tools for computational molecular biologyProvides tools forParsing files (fasta, clustalw, GenBank,)Interface to common softwaresOperations on sequencesSimple machine learning applicationsBLASTAnd many more12Installing BiopythonGo to http://biopython.org/wiki/DownloadWindowsUnix Select python 2.7NumPy is required13SeqIOThe standard Sequence Input/Output interface for BioPythonProvides a simple uniform interface to input and output assorted sequence file formatsDeals with sequences as SeqRecord objectsThere is a sister interfaceBio.AlignIOfor working directly with sequence alignment files as Alignment objects14Parsing a FASTA file15# Parse a simple fasta filefrom Bio import SeqIOfor seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):print seq_record.idprint repr(seq_record.seq)print len(seq_record) Why repr and not str?16

GenBank files17# genbank filesfrom Bio import SeqIOfor seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):print seq_record# added to print just one record examplebreak

GenBank files18from Bio import SeqIOfor seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):print seq_record.idprint repr(seq_record.seq)print len(seq_record)

Sequence objectsSupport similar methods as standard stringsProvide additional methodsTranslateReverse complementSupport different alphabetsAGTAGTTAAA can beDNAProtein19Sequences and alphabetsBio.Alphabet.IUPACprovides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitionsFor example:Adding ambiguous symbolsAdding special new characters20Example generic alphabet21>>> from Bio.Seq import Seq>>> my_seq = Seq("AGTACACTGGT")>>> my_seqSeq('AGTACACTGGT', Alphabet())>>> my_seq.alphabetAlphabet()Non-specific alphabetExample specific sequences22>>> from Bio.Seq import Seq>>> from Bio.Alphabet import IUPAC>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)>>> my_seqSeq('AGTACACTGGT', IUPACUnambiguousDNA())>>> my_seq.alphabetIUPACUnambiguousDNA()

>>> from Bio.Seq import Seq>>> from Bio.Alphabet import IUPAC>>> my_prot = Seq("AGTACACTGGT", IUPAC.protein)>>> my_protSeq('AGTACACTGGT', IUPACProtein())>>> my_prot.alphabetIUPACProtein() Sequences act like stringsAccess elements

Count without overlaps23>>> print my_seq[0] #first letterG>>> print my_seq[2] #third letterT>>> print my_seq[-1] #last letterG >>> from Bio.Seq import Seq>>> "AAAA".count("AA")2>>> Seq("AAAA").count("AA")2 Calculate GC content24>>> from Bio.Seq import Seq>>> from Bio.Alphabet import IUPAC>>> from Bio.SeqUtils import GC>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)>>> GC(my_seq)46.875 SlicingSimple slicing

Start, stop, stride25>>> from Bio.Seq import Seq>>> from Bio.Alphabet import IUPAC>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)>>> my_seq[4:12]Seq('GATGGGCC', IUPACUnambiguousDNA()) >>> my_seq[0::3]Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())>>> my_seq[1::3]Seq('AGGCATGCATC', IUPACUnambiguousDNA())>>> my_seq[2::3]Seq('TAGCTAAGAC', IUPACUnambiguousDNA()) ConcatenationSimple addition as in PythonBut, alphabets must fit26>>> from Bio.Alphabet import IUPAC>>> from Bio.Seq import Seq>>> protein_seq = Seq("EVRNAK", IUPAC.protein)>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)>>> protein_seq + dna_seqTraceback (most recent call last): Changing case27>>> from Bio.Seq import Seq>>> from Bio.Alphabet import generic_dna>>> dna_seq = Seq("acgtACGT", generic_dna)>>> dna_seqSeq('acgtACGT', DNAAlphabet())>>> dna_seq.upper()Seq('ACGTACGT', DNAAlphabet())>>> dna_seq.lower()Seq('acgtacgt', DNAAlphabet()) Changing caseCase is important for matching

IUPAC names are upper case28>>> "GTAC" in dna_seqFalse>>> "GTAC" in dna_seq.upper()True >>> from Bio.Seq import Seq>>> from Bio.Alphabet import IUPAC>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)>>> dna_seqSeq('ACGT', IUPACUnambiguousDNA())>>> dna_seq.lower()Seq('acgt', DNAAlphabet()) Reverse complement29>>> from Bio.Seq import Seq>>> from Bio.Alphabet import IUPAC>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)>>> my_seq.complement()Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())>>> my_seq.reverse_complement()Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA()) Transcription30>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)>>> template_dna = coding_dna.reverse_complement()>>> messenger_rna = coding_dna.transcribe()>>> messenger_rnaSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

As you can see, all this does is switch T U, and adjust the alphabet.

TranslationSimple example31>>> from Bio.Seq import Seq>>> from Bio.Alphabet import IUPAC>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)>>> messenger_rnaSeq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())>>> messenger_rna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) Stop codon!Translation from the DNA32>>> from Bio.Seq import Seq>>> from Bio.Alphabet import IUPAC>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)>>> coding_dnaSeq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())>>> coding_dna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))Using different translation tablesIn several cases we may want to use different translation tablesTranslation tables are given IDs in GenBank (standard=1)Vertebrate Mitochondrial is table 2

More details in http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi33

Using different translation tables34>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)

>>> coding_dna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

>>> coding_dna.translate(table="Vertebrate Mitochondrial")Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

>>> coding_dna.translate(table=2)Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) Translation tables in biopython35

Translate up to the first stop in frame36>>> coding_dna.translate()Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))>>> coding_dna.translate(to_stop=True)Seq('MAIVMGR', IUPACProtein())>>> coding_dna.translate(table=2)Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))>>> coding_dna.translate(table=2, to_stop=True)Seq('MAIVMGRWKGAR', IUPACProtein()) Comparing sequencesStandard == comparison is done by comparing the references (!), hence:37>>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna)>>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna)>>> seq1==seq2Warning (from warnings module): FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception) please use str(seq1)==str(seq2) to make your code explicit and to avoid this warning.False>>> seq1==seq1True Mutable vs. ImmutableLike strings standard seq objects are immutableIf you want to create a mutable object you need to write it by either:Use the tomutable() methodUse the mutable constructormutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)38Unknown sequences exampleIn many biological cases we deal with unknown sequences39>>> from Bio.Seq import UnknownSeq>>> from Bio.Alphabet import IUPAC>>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna)>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)>>> unk_dna+my_seqSeq('NNNNNNNNNNNNNNNNNNNNGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACAmbiguousDNA()) 40MSARead MSAUse Bio.AlignIO.read(file, format)File the file pathFormat support:stockholmfastaclustalUse help(AlignIO) for details41ExampleWe want to parse this file from PFAM42

Example43from Bio import AlignIOalignment = AlignIO.read("PF05371.sth", "stockholm")print alignment

Alignment object example44>>> from Bio import AlignIO>>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm")>>> print alignment[1]ID: Q9T0Q8_BPIKE/1-52Name: Q9T0Q8_BPIKEDescription: Q9T0Q8_BPIKE/1-52Number of features: 0/start=1/end=52/accession=Q9T0Q8.1Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA', SingleLetterAlphabet()) Alignment object example45>>> print "Alignment length %i" % alignment.get_alignment_length()Alignment length 52>>> for record in alignment: print "%s - %s" % (record.seq, record.id)AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/32-83AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/24-72AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73 Cross-references example46Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary structure?

>>> for record in alignment: if record.dbxrefs: print record.id, record.dbxrefsCOATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;']COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;']Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;']COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;'] CommentsRemember that almost all MSA formats are supportedWhen you have more than one MSA in your files use AlignIO.parse()Common example is PHYLIPs outputUse AlignIO.parse("resampled.phy", "phylip")The result is an iterator object that contains all MSAs47Write alignment to file48from Bio.Alphabet import generic_dnafrom Bio.Seq import Seqfrom Bio.SeqRecord import SeqRecordfrom Bio.Align import MultipleSeqAlignment

align1 = MultipleSeqAlignment([ SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),])

from Bio import AlignIOAlignIO.write(align1, "my_example.phy", "phylip")

3 12Alpha ACTGCTAGCT AGBeta ACT-CTAGCT AGGamma ACTGCTAGDT AG3 9Delta GTCAGC-AGEpislonGACAGCTAGZeta GTCAGCTAG3 13Eta ACTAGTACAG CTGTheta ACTAGTACAG CT-Iota - CTACTACAG GTG SlicingAlignments work like numpy matrices49>>> print alignment[2,6]T

# You can pull out a single column as a string like this:>>> print alignment[:,6]TTT---T

>>> print alignment[3:6,:6]SingleLetterAlphabet() alignment with 3 rows and 6 columnsAEGDDP COATB_BPM13/24-72AEGDDP COATB_BPZJ2/1-49AEGDDP Q9T0Q9_BPFD/1-49

>>> print alignment[:,:6]SingleLetterAlphabet() alignment with 7 rows and 6 columnsAEPNAA COATB_BPIKE/30-81AEPNAA Q9T0Q8_BPIKE/1-52DGTSTA COATB_BPI22/32-83AEGDDP COATB_BPM13/24-72AEGDDP COATB_BPZJ2/1-49AEGDDP Q9T0Q9_BPFD/1-49FAADDA COATB_BPIF1/22-73 External applicationsHow do we call MSA algorithms on unaligned set of sequences?Biopython provides wrappersThe idea:Create a command line object with the algorithm optionsInvoke the command (Python uses subprocesses)Bio.Align.Applicationsmodule:>>> import Bio.Align.Applications >>> dir(Bio.Align.Applications) ['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline', 'MuscleCommandline', 'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ]

50ClustalW exampleFirst step: download ClustalW from ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.1/Second step: installThird step: look for clustal exe filesNow you can run ClustalW from your Python code51Run example52>>> import os>>> from Bio.Align.Applications import ClustalwCommandline>>> clustalw_exe = r"C:\Program Files\new clustal\clustalw2.exe">>> clustalw_cline = ClustalwCommandline(clustalw_exe, infile="opuntia.fasta")>>> assert os.path.isfile(clustalw_exe), "Clustal W executable missing">>> stdout, stderr = clustalw_cline() The command line is actually a function we can run!ClustalW53>>> from Bio import AlignIO>>> align = AlignIO.read("opuntia.aln", "clustal")>>> print alignSingleLetterAlphabet() alignment with 7 rows and 906 columnsTATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273285|gb|AF191659.1|AF191TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273284|gb|AF191658.1|AF191TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273287|gb|AF191661.1|AF191TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273286|gb|AF191660.1|AF191TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273290|gb|AF191664.1|AF191TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273289|gb|AF191663.1|AF191TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAGA gi|6273291|gb|AF191665.1|AF191 ClustalW - tree54In case you are interested, the opuntia.dnd file ClustalW creates is just a standard Newick tree file, and Bio.Phylo can parse these:

>>> from Bio import Phylo>>> tree = Phylo.read("opuntia.dnd", "newick")>>> Phylo.draw_ascii(tree)

55BLASTRunning BLAST over the internetWe use the functionqblast()in theBio.Blast.NCBIWWWmodule. This has three non-optional arguments:The blast program to use for the search, as a lower case string: works with blastn, blastp, blastx, tblast and tblastx.The databases to search against. The options for this are available on the NCBI web pages athttp://www.ncbi.nlm.nih.gov/BLAST/blast_databases.shtml.A string containing your query sequence. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number.

56qblast additional parametersqblast can receive other parameters, analogous to the parameters of the actual serverImportant examples:format_type:"HTML","Text","ASN.1", or"XML". The default is"XML", as that is the format expected by the parser (see next examples)expectsets the expectation or e-value threshold.

57Step 1: call BLAST58>>> from Bio.Blast import NCBIWWW

# Option 1 - Use GI ID>>> result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")

# Option 2 read a fasta file>>> fasta_string = open("m_cold.fasta").read()>>> result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)

# option 3 parse file to seq object>>> record = SeqIO.read(open("m_cold.fasta"), format="fasta")>>> result_handle = NCBIWWW.qblast("blastn", "nt", record.seq) Step2: parse the resultsRead can be used only once!blast_record object keeps the actual results59>>> from Bio.Blast import NCBIXML>>> blast_record = NCBIXML.read(result_handle) RemarksBasically, Biopython supports reading BLAST results from HTMLs and text files.These methods are not stable and sometimes fail because the servers change the format.XML is stableYou can save XML filesIn the serverFrom result_handle objects (next slide)60Save results as XMLRead can be used only once!61>>> save_file = open("my_blast.xml", "w")>>> save_file.write(result_handle.read())>>> save_file.close()>>> result_handle.close()BLAST recordsA BLAST Record contains everything you might ever want to extract from the BLAST output.Example:62>>> E_VALUE_THRESH = 0.04>>> for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: print '****Alignment****' print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query[0:75] + '' print hsp.match[0:75] + '' print hsp.sbjct[0:75] + ''BLAST records63

More functionsWe cover here very basic functionsTo get more details use64>>> import Bio.Blast.Record>>> help(Bio.Blast.Record)Help on module Bio.Blast.Record in Bio.Blast:

NAMEBio.Blast.Record - Record classes to hold BLAST output.

FILEd:\python27\lib\site-packages\bio\blast\record.py

DESCRIPTIONClasses:Blast Holds all the information from a blast search.PSIBlast Holds all the information from a psi-blast search.Header Holds information from the header.DescriptionHolds information about one hit description.Alignment Holds information about one alignment hit.HSPHolds information about one HSP.MultipleAlignment Holds information about a multiple alignment.DatabaseReport Holds information from the database report.Parameters Holds information from the parameters. 65Accessing NCBIs Entrez DatabasesBio.EntrezModule for programmatic access to EntrezExample: search PubMed or download GenBank records from within a Python scriptMakes use of the Entrez Programming Utilities http://www.ncbi.nlm.nih.gov/entrez/utils/Makes sure that the correct URL is used for the queries, and that not more than one request is made every three seconds, as required by NCBINote! If the NCBI finds you are abusing their systems, they can and will ban your access!66ESearch example67>>> handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]")>>> record = Entrez.read(handle)# Each of the IDs is a GenBank identifier.>>> print (record["IdList"]) ['126789333', '442591189', '442591187', '442591185', '442591183', '442591181', '442591179', '442591177', '442591175', '442591173', '442591171', '442591169', '442591167', '442591165', '442591163', '442591161', '442591159', '442591157', '442591155', '442591153']ExplanationEntrez.readTransforms the actual results (retrieved as XML) to a usable object of type Bio.Entrez.Parser.DictionaryElement

68>>> record{u'Count': '158', u'RetMax': '20', u'IdList': ['126789333', '442591189', '442591187', '442591185', '442591183', '442591181', '442591179', '442591177', '442591175', '442591173', '442591171', '442591169', '442591167', '442591165', '442591163', '442591161', '442591159', '442591157', '442591155', '442591153'], u'TranslationStack': [{u'Count': '2482', u'Field': 'Organism', u'Term': '"Cypripedioideae"[Organism]', u'Explode': 'Y'}, {u'Count': '71514', u'Field': 'Gene', u'Term': 'matK[Gene]', u'Explode': 'N'}, 'AND'], u'TranslationSet': [{u'To': '"Cypripedioideae"[Organism]', u'From': 'Cypripedioideae[Orgn]'}], u'RetStart': '0', u'QueryTranslation': '"Cypripedioideae"[Organism] AND matK[Gene]'} Database options69'pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest', 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap', 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound', 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists' Download a full record70>>> from Bio import Entrez# Always tell NCBI who you are>>> Entrez.email = [email protected]# rettype: get a GenBank record>>> handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text")>>> print handle.read() 71

Change gb to fasta72

Read directly to Seq.IO object73>>> from Bio import Entrez, SeqIO>>> handle = Entrez.efetch(db="nucleotide", id="186972394",rettype="gb", retmode="text")>>> record = SeqIO.read(handle, "genbank")>>> handle.close()>>> print recordID: EU490707.1Name: EU490707Description: Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast.Number of features: 3...Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA', IUPACAmbiguousDNA()) Download directly from a URLSuppose we know how the database URLs look likeExample: GEO (gene expression omnibus)"http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE6609&format=file"74Use the urlib2 module75>>> import urllib2>>> u = urllib2.urlopen('http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE6609&format=file')>>> localFile = open('gse6609_raw.tar', 'w')>>> for x in u:localFile.write(x)

>>> localFile.close() More detailsWe covered only a few conceptsFor more details on Biopython options, including dealing with specialized parsers, see http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:parsing-blastChapter 9Look at the urllib2 manualhttp://docs.python.org/2/library/urllib2.html7677Sequence MotifsGene expression regulationTranscription is regulated mainly by transcription factors (TFs) - proteins that bind to DNA subsequences, called binding sites (BSs)TFBSs are located mainly in the genes promoter the DNA sequence upstream the genes transcription start site (TSS)TFs can promote or repress transcriptionOther regulators: micro-RNAs (miRNAs)

78Ab-initio motif discoveryYou are given a set of stringsYou want to find a motif that is significantly represented in the stringsFor example: TF\miRNA binding site79

TFBS modelsThe BSs of a particular TF share a common pattern, or motif, which is often modeled using:Degenerate stringGGWATB (W={A,T}, B={C,G,T})

PWM = Position weight matrix

65432100.20.700.80.1A0.60.40.10.50.10C0.10.40.10.500G0.300.100.10.9TCutoff = 0.009

AGCTACACCCATTTAT 0.06AGTAGAGCCTTCGTG 0.06CGATTCTACAATATGA 0.01

ATCGGAATTCTGCAGGGCAATTCGGGAATGAGGTATTCTCAGATTA80

Cluster ICluster IICluster IIIGene expressionmicroarraysClustering

Location analysis(ChIP-chip, )

Functional group(e.g., GO term)Motif discovery: The typical two-step pipeline

Promoter/3UTRsequencesMotifdiscoveryCo-regulated gene set

Motif discovery: Goals and challengesGoal: Reverse-engineer the transcriptional regulatory network Challenges:BSs are short and degenerate (non-specific)Promoters are long + complex (hard to model)Search space is huge (motif and sequence)Data is noisyWhat to look for? (enriched?, localized?, conserved?)Problem is still considered very difficult despite extensive research82Biopython motif objects83from Bio import motifsfrom Bio.Seq import Seqinstances = [Seq("TACAA"),Seq("TACGC"),Seq("TACAC"),Seq("TACCC"),Seq("AACCC"),Seq("AATGC"),Seq("AATGC")]m = motifs.create(instances)print mTACAATACGCTACACTACCCAACCCAATGCAATGC Biopython motif objects84>>> print m.counts0 1 2 3 4A: 3.00 7.00 0.00 2.00 1.00C: 0.00 0.00 5.00 2.00 6.00G: 0.00 0.00 0.00 3.00 0.00T: 4.00 0.00 2.00 0.00 0.00 Biopython motif objects85>>> m.consensusSeq('TACGC', IUPACUnambiguousDNA())

#The anticonsensus sequence, corresponding to the smallest values in the columns of the .counts matrix:

>>> m.anticonsensusSeq('GGGTG', IUPACUnambiguousDNA()) Motif database (http://jaspar.genereg.net/)86

87

88

89

90

Read records91from Bio import motifsarnt = motifs.read(open("Arnt.sites"), "sites")print arnt.counts

0 1 2 3 4 5A: 4.00 19.00 0.00 0.00 0.00 0.00C: 16.00 0.00 20.00 0.00 0.00 0.00G: 0.00 1.00 0.00 20.00 0.00 20.00T: 0.00 0.00 0.00 0.00 20.00 0.00 MEMEMEME is a tool for discovering motifs in a group of related DNA or protein sequences. It takes as input a group of DNA or protein sequences and outputs as many motifs as requested.Therefore, in contrast to JASPAR files, MEME output files typically contain multiple motifs.92AssumptionsThe number of motifs is knownAssume this number is 1The size of the motif is knownBiologically, we have estimates for the size for TFs and miRNAMissing informationPWM of the motifPWM of the backgroundMotif locations93AssumptionsGiven a sequence X and a PWM Y, of the same length we can calculate P(X|Y)Assume independence of motif positions94

AssumptionsGiven a sequence X and a PWM Y, of the same length we can calculate P(X|Y)Assume independence of motif positions

Given a PWM we can now calculate for each position K in each sequence J the probability the motif starts at K in the sequence J.95

Start with initial guess for the PWMsThe EM algorithm consists of the two steps, which are repeated consecutively. Step 1, estimate the probability of finding the site at any position in each of the sequences. These probabilities are used to provide new information as to expected base or aa distribution for each column in the site.Step 2, the maximization step, the new counts for bases or aa for each position in the site found in the step 1 are substituted for the previous set.Expectation Maximization (EM) AlgorithmExpectation Maximization (EM) AlgorithmOOOOOOOOXXXXOOOOOOOOOOOOOOOOXXXXOOOOOOOO o o o o o o o o o o o o o o o o o o o o o o o oOOOOOOOOXXXXOOOOOOOO OOOOOOOOXXXXOOOOOOOO IIII IIIIIIII IIIIIIIColumns defined by a preliminary alignment of the sequences provide initial estimates of frequencies of aa in each motif columnBasesBackgroundSite column 1Site column 2G0.270.40.1C0.250.40.1A0.250.20.1T0.230.20.7Total1.001.001.00Columns not in motif provide background frequenciesExpectation Maximization (EM) AlgorithmThe resulting score gives the likelihood that the motif matches positions A, B or other in seq 1. Repeat for all other positions and find most likely locator. Then repeat for the remaining seqs.ABXXXXOOOOOOOOOOOOOOOOXXXXIIII IIIIIIIIIIIIIIIIOXXXXOOOOOOOOOOOOOOO XXXX IIIII IIIIIIIIIIIIIIIbackground frequencies in the remaining positions.XUse previous estimates of aa or nucleotide frequencies for each column in the motif to calculate probability of motif in this position, and multiply by.. The site probabilities for each seq calculated at the 1st step are then used to create a new table of expected values for base counts for each of the site positions using the site probabilities as weights. Suppose that P (site 1 in seq 1) = Psite1,seq1 / (Psite1,seq1 + Psite2,seq1 + + Psite78,seq1 ) = 0.01 and P (site 2 in seq 1) = 0.02. Then this values are added to the previous table as shown in the table below. This procedure is repeated for every other possible first columns in seq1 and then the process continues for all other sequences resulting in a new version of the table. The expectation and maximization steps are repeated until the estimates of base frequencies do not change.EM Algorithm 2nd optimisation step: calculationsBasesBackgroundSite column 1Site column 2G0.27 + 0.4 + 0.1 + C0.25 + 0.4 + 0.1 + A0.25 + 0.2 + 0.010.1 + T0.23 + 0.2 + 0.7 + 0.02Total/weighted1.001.001.00Run MEME (http://meme.nbcr.net/meme/cgi-bin/meme.cgi)100

Results101

Parse results102>>> handle = open("meme.dna.oops.txt")>>> record = motifs.parse(handle, "meme")>>> handle.close()>>> len(record)2>>> motif = record[0]>>> print motif.consensusTTCACATGCCGC>>> print motif.degenerate_consensusTTCACATGSCNC Motif attributes103>>> motif.num_occurrences7>>> motif.length12>>> evalue = motif.evalue>>> print "%3.1g" % evalue0.2>>> motif.name'Motif 1' Where the motif was found104>>> motif = record['Motif 1']# Each motif has an attribute .instances with the sequence instances in which the motif was found, providing some information on each instance>>> len(motif.instances)7>>> motif.instances[0]Instance('TTCACATGCCGC', IUPACUnambiguousDNA())>>> motif.instances[0].start620>>> motif.instances[0].strand'-'>>> motif.instances[0].length12>>> pvalue = motif.instances[0].pvalue>>> print "%5.3g" % pvalue1.85e-08 Amadeus105Advanced algorithms improve upon MEMEThis is an algorithm for motif findingAppears to be one of the top algorithms in many testsJava based toolEasy to use GUISupports analysis of TFs and miRNAsDeveloped here in TAUAmadeus A Motif Algorithm for Detecting Enrichment in mUltiple SpeciesSupports diverse motif discovery tasks:Finding over-represented motifs in one or more given sets of genes.Identifying motifs with global spatial features given only the genomic sequences.Simultaneous inference of motifs and their associated expression profiles given genome-wide expression datasets.How?A general pipeline architecture for enumerating motifs.Different statistical scoring schemes of motifs for different motif discovery tasks.

106Input: ~350 genes expressed in the human G2+M cell-cycle phases [Whitfield et al. 02]

CHRNF-Y (CCAAT-box)

Pairs analysis108Clustering analysisClustering - reminderCluster analysis is the grouping of items into clusters based on the similarity of the items to each other.Bio.Cluster moduleKmeansSOMHierarchical clusteringPCA109110K-means clusteringMacQueen, 65Input: a set of observations (x1,x2, ,xn)For example, each observation is a gene, and x is the values

Goal: partition the observation to K clusters S={S1,S2,,Sk}

Objective function:

110111K-means clusteringMacQueen, 65Initialize an arbitrary partition P into k clusters C1 ,, Ck.For cluster Cj, element i Cj, EP(i, Cj) = cost of soln. if i is moved to cluster Cj. Pick EP(r, Cs) if the new partition is better Repeat until no improvement possibleRequires knowledge of k111112K-means variationsCompute a centroid cp for each cluster Cp, e.g., gravity center = average vectorSolution cost: clusters pi in cluster pd(vi,cp)Parallel version: move each to the cluster with the closest centroid simultaneouslySequential version: one at a timemoving centers approachObjective = homogeneity only (k fixed)112113113The centers of the clusters that were deserted should move too this is not shown.114114Data representationThe data to be clustered are represented by anmNumerical Python arraydata.Within the context of gene expression data clustering, typically the rows correspond to different genes whereas the columns correspond to different experimental conditions.The clustering algorithms inBio.Clustercan be applied both to rows (genes) and to columns (experiments).115Distance\Similarity functions'e': Euclidean distance'c': Pearson correlation coefficient'a': Absolute value of the Pearson correlation coefficient'u': cosine of the angle between two data vectors'x': Absolute uncentered Pearson correlation's': Spearmans rank correlation

116Calculating distance matrices>>> from Bio.Cluster import distancematrix >>> matrix = distancematrix(data)data- requiredAdditional options:transpose(default:0)Determines if the distances between the rows ofdataare to be calculated (transpose==0), or between the columns ofdata(transpose==1).dist(default:'e', Euclidean distance)

117DistancematrixTo save space Biopython keeps only the lower\upper triangle of the matrix118

Partitioning algorithmsAlgorithms that receive the number of clusters K as an argumentKmeansKmediansOften referred to as EM variations119Analysis example120

Analysis example121# Read the dataimport csvfile = open('ge_data_example.txt', 'rb')data = csv.reader(file, delimiter='\t')table = [row for row in data]

>>> len(table)100>>> table[1][1]'9.412'>>> table[0][0]'sample'>>> len(table[1])17 Analysis example122# Transform the data to numpy matrixfrom numpy import *mat = matrix(table[1:][1:],dtype='float')print len(mat)

# Create the distance matrixfrom Bio.Cluster import distancematrixdist_matrix = distancematrix(mat)

# Clusterfrom Bio.Cluster import kclusterclusterid, error, nfound = kcluster(mat) Analysis example123

# Clusterfrom Bio.Cluster import kclusterclusterid, error, nfound = kcluster(mat) Clusterid: array with cluster assignmentsError: the within cluster sum of distancesNfound: the number of times the returned solution was foundAnalysis example124>>> clusteridarray([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])>>> error15988.118370804612>>> nfound1Kcluster: other optionsnclusters(default:2): the number of clustersk.transpose(default:0): Determines if rows (transposeis0) or columns (transposeis1) are to be clustered.npass(default:1): the number of times thek-means/-medians clustering algorithm is performedmethod(default:a): describes how the center of a cluster is found:method=='a': arithmetic mean (k-means clustering);method=='m': median (k-medians clustering).dist(default:'e', Euclidean distance)initialid(default:None)Specifies the initial clustering to be used for the algorithm.125Hierarchical clustering126from Bio.Cluster import treeclustertree1 = treecluster(mat)# Can be applied to a precalculated distance matrixtree2 = treecluster(distancematrix=dist_matrix)# Get the cluster assignmentsclusterid = tree1.cut(3) Hierarchical clustering using SciPyBetter visualizations!127# Create a distance matrixX=matD = scipy.zeros([len(x),len(x)])for i in range(len(x)):for j in range(len(x)):D[i,j] = sum(abs(x[i] - x[j])) Hierarchical clustering using SciPy128# Compute and plot first dendrogram.fig = pylab.figure(figsize=(8,8))# Add an axes at positionrect[left,bottom,width,height] where all quantities are in fractions of figure width and height. ax1 = fig.add_axes([0.09,0.1,0.2,0.6])# Clustering analysisY = sch.linkage(D, method='centroid')Z1 = sch.dendrogram(Y, orientation='right')ax1.set_xticks([])ax1.set_yticks([]) Hierarchical clustering using SciPy129# Plot distance matrix.axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])idx1 = Z1['leaves']D = D[idx1,:]im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=pylab.cm.YlGnBu)axmatrix.set_xticks([])axmatrix.set_yticks([]) Hierarchical clustering using SciPy130# Plot colorbar.axcolor = fig.add_axes([0.91,0.1,0.02,0.6])pylab.colorbar(im, cax=axcolor)fig.show()

131Phylogenetic treesRemember the Newick format?Simple example without branch length132(((A,B),(C,D)),(E,F,G))

Visualizing trees133>>> localFile.close()>>> from Bio import Phylo>>> tree = Phylo.read("simple.dnd", "newick")>>> print treeTree(weight=1.0, rooted=False)Clade(branch_length=1.0)Clade(branch_length=1.0)Clade(branch_length=1.0)Clade(branch_length=1.0, name='A')Clade(branch_length=1.0, name='B')Clade(branch_length=1.0)Clade(branch_length=1.0, name='C')Clade(branch_length=1.0, name='D')Clade(branch_length=1.0)Clade(branch_length=1.0, name='E')Clade(branch_length=1.0, name='F')Clade(branch_length=1.0, name='G') Visualizing trees134

Use matplotlib135>>> import matplotlib>>> tree.rooted = True>>> Phylo.draw(tree)

Phylo IOPhylo.read() reads a tree with exactly one treeIf you have many trees use a loop over the returned object of Phylo.parse()Write to file using Phylo.write(treeObj,format)Popular formats: nwk, xmlConvert tree formats using Phylo.convertPhylo.convert("tree1.xml", "phyloxml", "tree1.dnd", "newick")136