1 sequence formats >fosb_mouse protein fosb. 338 bp...

18
1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL Say we get this protein sequence in fasta format from a database: Now we need to compare this sequence to all sequences in some other database. Unfortunately this database uses the phylip format, so we need to translate: Phylip Format: The first line of the input file contains the number of sequences and their length (all should have the same length) separated by blanks. The next line contains a sequence name, next lines are the

Post on 20-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

1

Sequence formats

>FOSB_MOUSE Protein fosB. 338 bp

MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA

ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT

DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD

LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY

TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

Say we get this protein sequence in fasta format from a database:

Now we need to compare this sequence to all sequences in some other database. Unfortunately this database uses the phylip format, so we need to translate:

Phylip Format:

The first line of the input file contains the number of sequences and their length (all should have the same length) separated by blanks.

The next line contains a sequence name, next lines are the sequence itself in blocks of 10 characters. Then follow rest of sequences.

Page 2: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

2

Sequence formats

>FOSB_MOUSE Protein fosB. 338 bp

MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA

ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT

DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD

LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSY

TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

1 338

FOSB_MOUSE MFQAFPGDYD SGSRCSSSPS AESQYLSSVD SFGSPPTAAA SQECAGLGEM

PGSFVPTVTA ITTSQDLQWL VQPTLISSMA QSQGQPLASQ PPAVDPYDMP

GTSYSTPGLS AYSTGGASGS GGPSTSTTTS GPVSARPARA RPRRPREETL

TPEEEEKRRV RRERNKLAAA KCRNRRRELT DRLQAETDQL EEEKAELESE

IAELQKEKER LEFVLVAHKP GCKIPYEEGP GPGPLAEVRD LPGSTSAKED

GFGWLLPPPP PPPLPFQSSR DAPPNLTASL FTHSEVQVLG DPFPVVSPSY

TSSFVLTCPE VSAFAGAQRT SGSEQPSDPL NSPSLLAL

Fasta

Phylip

So we copy and paste and reformat the sequence:

and all is well.

Then our boss says “Do it for these 5000 sequences.”

Page 3: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

3

We need automatic filter!

• A program that reads any number of fasta sequences and converts them into phylip format (want to run sequences through a filter)

• Program structure:

1. Open fasta file

2. Parse file to extract needed information

3. Create and save phylip file

• We will use this definition for the fasta format:– The header starts with >

– The word immediately following the ">" is a unique ID; next two words are the name of the sequence, the rest of the header is a description.

– All lines of text are shorter than 80 characters.

Page 4: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

4

Pseudo-code fasta→phylip filter

1. Open and parse fasta file2. From each header extract sequence ID

and name1. Open phylip file2. Write “1” followed by sequence length3. Write sequence ID 4. Write sequence in blocks of 105. Close file

Page 5: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

5

The other way too: pseudo-code phylip→fasta filter

1. Open phylip file

2. Find first non-empty line, ignore!

3. Parse next line and extract first word (sequence ID)1. Read rest of line and following lines to

get the sequence, skipping blanks

2. Read next sequences

4. Open fasta file, and for each sequence:1. Write “>” followed by sequence name

2. Write sequence in lines of 80

5. Close files

Page 6: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

6

More formats?

• Boss: “Great! What about EMBL and GDE formats?”

Coding, coding,.. : 12 filters!

fastaphylip

fasta - phylip

phylip-fasta

Page 7: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

7

Still more formats?

• Boss: “Super. And Genebank and ClustalW..?”

Coding, coding, coding, ..: 30 filters

• Next new format = 12 new filters!

• This doesn’t scale.

Page 8: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

8

Intermediate format

• Use an internal format as intermediate step:

• Two formats: four filters

fasta

phylip

internal

phylip-internal

internal-phylip

fasta - internal

internal-fasta

Page 9: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

9

Intermediate format

• Six formats: 12 filters (not 30)

• New format: always two new filters only

i-format

Page 10: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

10

Let’s build a structured set of filters!

• Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format

• Each internal2y filter module: save i-format sequences in (separate) file(s) in y format

• Example: Overall phylip-fasta filter: – import phylip2i and i2fasta modules– obtain filenames to load from and save to from

command line– call parse_file method of the phylip2i module– call the save_to_files method of the i2fasta

module

Page 11: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

11

Internal representation of a sequenceIs

eque

nce.

py (

part

1)

Attributes: type (DNA/protein), name, and a unique ID number

Page 12: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

12Iseq

uenc

e.py

(pa

rt 2

)

Page 13: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

13

Example: fasta/phylip filterFirst fasta2internal. Each x2internal filter module: parse file in x format, extract information, return sequence(s) in internal format

fast

a2i.p

y

Page 14: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

14

Then internal2phylip. Each internal2y filter module: save each i-format sequence in separate file in y format

i2ph

ylip

.py

Page 15: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

15

1. Import parse_file method from fasta2i module

2. Import save_to_files method from i2phylip module

3. Obtain filenames to load from and save to from command line

4. Call parse_file method

5. Call the save_to_files method

Putting the parts together: Fasta/phylip filterfa

sta2

phyl

ip.p

y

NB: nothing in code about phylip and fasta below this point..

Page 16: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

16

Sketch for i2embl filter module

Use i2phylip filter as template, much of the code can be reused.

Only these parts have to be rewritten

NB: Same method name save_to_files

Page 17: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

17

Complete fasta/embl filter

Almost the same code as the fasta2phylip filter: only change is thatthe method save_to_files is imported from new module

fast

a2em

bl.p

y

(assuming we have the i2embl filter..)

Page 18: 1 Sequence formats >FOSB_MOUSE Protein fosB. 338 bp MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS

18

.. on to the exercises