protein functions prediction

42
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2004.08 Protein functions prediction

Upload: susan

Post on 12-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Protein functions prediction. Signal peptides Transmembrane regions and topology PTM (post-translational modifications) Low complexity and biased regions Repeats Coils. Secondary structure Antigenic peptides Domain/Motifs Tools The EMBOSS package. Introduction. Different techniques. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Protein functions prediction

Page 2: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Introduction

Signal peptides Transmembrane

regions and topology PTM (post-

translational modifications)

Low complexity and biased regions

Repeats Coils

Secondary structure Antigenic peptides Domain/Motifs Tools The EMBOSS package

Page 3: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Different techniques

Algorithms Sliding window, Nearest Neighbor Patterns, regular expression Weight matrices HMM, profiles Neural Networks Rules

Page 4: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Sliding window

THISISATESTSEQVENCETHATDISPLAYSTHESLIDINGWINDQW

Score1Score2

Scoren

Width or Size=11, Step=5

Results are usually displayed as a graph, see example ->

Page 5: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Patterns / regular expression

Pattern: <A-x-[ST](2)-x(0,1)-{V} Regexp: ^A.[ST]{2}.?[^V] Text: The sequence must start with an

alanine, followed by any amino acid, followed by a serine or a threonine, two times, followed by any amino acid or nothing, followed by any amino acid except a valine.

Simply the syntax differ…

Page 6: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Weight matrices (PSSM)

Page 7: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

HMM / profiles

Page 8: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Neural Networks

General principle: Example:

Page 9: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Signals found in proteins

N-ter exportation - secretion mitochondria chloroplast

internal NLS (nuclear

localization signal)

C-ter GPI-anchor (Glycosyl

Phosphatidyl Inositol) other membrane

anchors (see PTM) other unknown ?

Page 10: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Signals detection tools

SignalP MitoProt ChloroP Predotar PSort TargetP Sigcleave (EMBOSS) Phobius

Big-PI DGPI

Page 11: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Transmembrane regions

Detection (signal peptide, hydropathy, helices) Organisation (topology)

Page 12: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Transmembrane detection tools

TMHMM TMPred TopPred2 DAS HMMTop Tmap (EMBOSS)

Mixture of tools Phobius ConPred II

Page 13: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Post translational modifications

Phosphorylation S - T - Y

N-glycosylation N

O-glycosylation S - T - (HO)K

Acetylation, methylation D - E - K

Sulfation Y

Farnesylation, myristylation, palmitoylation, geranylgeranylation, GPI-anchor C - Nter - Cter

Ubiquitination and family K - Nter

Inteins (protein splicing) Pre-translational

Selenoprotein C

Page 14: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

PTM detection

Pattern prediction (PROSITE)

Short or weak signal Frequent hit producer Best method is

experimental MS/MS detection

Most method use « rules » joining pattern detection and knowledge to predict sites.

NetOGlyc - Prediction of type O-glycosylation sites in mammalian proteins

DictyOGlyc - Prediction of GlcNAc O-glycosylation sites in Dictyostelium

YinOYang - O-beta-GlcNAc attachment sites in eukaryotic protein sequences

NetPhos - Prediction of Ser, Thr and Tyr phosphorylation sites in eukaryotic proteins

NMT - Prediction of N-terminal N-myristoylation

Sulfinator - Prediction of tyrosine sulfation sites

Page 15: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Low complexity regions

repeats compositional bias PEST

Page 16: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Low complexity / Repeats

DUST (DNA) / SEG de novo detection

RepeatMasker (DNA) search collection

REP search collection

REPRO, Radar de novo detection

PEST, PESTFind de novo detection

EMBOSS (DNA) einverted equicktandem etandem palindrome

EMBOSS (protein) oddcomp

Page 17: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Coils

Helix of helix coiled-coil

Leu-zipper

Page 18: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Coils detection

COILS Weight matrices

Paircoil, Multicoil Pairwise correlation

Marcoil HMM

Pepcoil (EMBOSS) Weight matrices

Page 19: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Secondary structure

Structure to predict Alpha-helices Beta-sheets Turns Random coil

Garnier (EMBOSS) PHD DSC PREDATOR NNSSP Jpred Jnet Many others

Page 20: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Antigenic peptide

Peptides binding to MHC class I

8, 9, 10 mers class II

15 mers (3+9+3) Depend highly on MHC

type

Use of experimental knowledge Databases of known

peptides

SYFPEITHI HLA_Bind (BIMAS) MAPPP combined expert Antigenic (EMBOSS) Many more

Prediction of proteasome cleavage sites

NetChop PaProc

Page 21: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Domain / Motif

All the protein domain descriptors PROSITE PFAM SMART PRODOM BLOCKS PRINTS TIGRfam …

Federation: InterPro Many techniques

Patterns, Regexp PSSM (PSI-BLAST) Profiles HMM

Page 22: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Other Tools

You can find some of them on our servers www.ch.embnet.org

Or on ExPASy server www.expasy.org/tools

Or ask Google!! www.google.com

Page 23: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

European Molecular Biology Open Software Suite

Page 24: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

How to use EMBOSS/Jemboss at SIB

Page 25: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Free Open Source (for most Unix plateforms) GCG successor (compatible with GCG file format) More than 150 programs (ver. 2.9.0) Easy to install locally

but no interface, requires local databases Unix command-line only

Interfaces Jemboss, www2gcg, w2h, wemboss… (with account) Pise, EMBOSS-GUI, SRSWWW (no account) Staden, Kaptain, CoLiMate, Jemboss (local)

Access: www.emboss.org or emboss.sourceforge.net

Page 26: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Format USA 'asis' :: Sequence [start : end : reverse] Format :: '@' ListFile [start : end : reverse] Format :: 'list' : ListFile [start : end : reverse] Format :: Database : Entry [start : end : reverse] Format :: Database - SearchField : Word [start : end : reverse] Format :: File : Entry [start : end : reverse] Format :: File : SearchField : Word [start : end : reverse] Format :: Program Program-parameters '|' [start : end : reverse]

Example: fasta::Swissprot:UBP5_HUMAN[200:300]

Databases Any can be added, use showdb to display the available databases

Some details

Page 27: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

showdbDisplays information on the currently available databases# Name Type ID Qry All Comment# ==== ==== == === === =======ipr_fetch P OK OK OK InterPro current by fetchipi_fetch P OK OK OK IPI current by fetchrefseq_fetch P OK OK OK refseq current by fetchrepbase_fetch P OK OK OK repbase current by fetchswiss_fetch P OK OK OK SwissProt current by fetchswissprot P OK OK OK SWISSPROT sequencestrembl P OK OK OK TREMBL sequencestrembl_fetch P OK OK OK trembl current by fetchtremblnew P OK OK OK TREMBL New sequencesug_fetch P OK OK OK Unigene by fetchembl N OK OK OK EMBL releaseemhum N OK OK OK EMBL release, Human section by emboss indexemrod N OK OK OK EMBL release, Rodent section by emboss indexemvrt N OK OK OK EMBL release, Vertebrate (nonhuman, nonrodent)

seqret (seqretall, seqretset, seqretsplit) entret (for complete untouched entry, e.g., for unigene, interpro,

swissprot…) Possible to define your own « .embossrc » file

databases

Page 28: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Some tools for DNA redata Search REBASE for enzyme name, references, suppliers etc remap Display a sequence with restriction cut sites, translation etc restover Finds restriction enzymes that produce a specific overhang restrict Finds restriction enzyme cleavage sites showseq Display a sequence with features, translation etc silent Silent mutation restriction enzyme scan cirdna Draws circular maps of DNA constructs lindna Draws linear maps of DNA constructs revseq Reverse and complement a sequence …

Page 29: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Example: remap

ECLAC E.coli lactose operon with lacI, lacZ, lacY and lacA genes. Hin6I TaqI | HhaI | Bsc4I | Bsu6I | | Hin6I | BssKI | | | HhaI AciI | | BsiSI \ \ \ \ \ \ \ \ GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGT 10 20 30 40 50 60 ----:----|----:----|----:----|----:----|----:----|----:----| CTGTGGTAGCTTACCGCGTTTTGGAAAGCGCCATACCGTACTATCGCGGGCCTTCTCTCA / / / / / / / /// | TaqI | Hin6I AciI | | ||BssKI Bsc4I HhaI | | |BsiSI | | Bsu6I | Hin6I HhaI# Enzymes that cut Frequency Isoschizomers AciI 1 Bsc4I 1 BsiSI 1 BssKI 1 Bsu6I 1 HhaI 2 Hin6I 2 HinP1I,HspAI TaqI 1# Enzymes that do not cutAclI BamHI BceAI Bse1I BshI ClaI EcoRI EcoRII Hin4I HindII HindIII HpyCH4IV KpnI NotI

Page 30: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Example: cirdna File: ../../data/data.cirpStart 1001End 4270grouplabelBlock 1011 1362 3ex1endlabellabelTick 1610 8EcoR1endlabellabelBlock 1647 1815 1endlabellabelTick 2459 8BamH1endlabellabelBlock 4139 4258 3ex2endlabelendgroupgrouplabelRange 2541 2812 [ ] 5AluendlabellabelRange 3322 3497 > < 5MER13endlabelendgroup

Page 31: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Example: plotorf

Page 32: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

EMBOSS format input/output

UFO Universal Feature Object gff, swissprot, embl, pir, nbrf (with or without sequence)

Alignments Multiple and pairwise, many flavors (FASTA, MSF, SRS…)

Reports Feature (UFO), SRS, motif, seqtable, excel, diffseq, listfile

(USA), etc… Sequences (compatible with USA)

Many!!! E.g., fasta, clustal, gcg, paup, gff, embl, swissprot, acedb, abi, etc…

Page 33: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Web interfaces

PISE (Pasteur Institute Software Environment) http://www-alt.pasteur.fr/~letondal/Pise/

wEMBOSS (Belgium&Argentina) (not yet at SIB) http://www.wemboss.org

Page 34: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Pise a tool to generate Web interfaces for Molecular Biology programs

http://emboss.ch.embnet.org/Pise

Page 35: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

http://www.wemboss.org

Page 36: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Page 37: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Launch Jembosshttp://emboss.ch.embnet.org/Jemboss

Page 38: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Launch Jemboss

First time only…

Each time…

Page 39: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Jemboss windows

Page 40: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Jemboss windows other systems

Page 41: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Summary

Anonymous web access through Pise Registered access through Jemboss Registered access through command-line

(requires UNIX skills)

Please report problems!

Page 42: Protein functions prediction

Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

LF-2004.08

Exercises

DEA Exercises web based sequence analysis The goal of this exercise is to use web based tools for protein sequence analysis

a) Take this TrEMBL sequence (Q9X252) and try a BLAST against swissprot with the complete protein or with the first 70 residues. Explain the difference. Use TMPred, SignalP, and COILS to help you.

b) Pass this sequence through PFSCAN and search all databases. Compare with this command on ludwig-sun1/2: hits -b "prf pat pfam" tr:Q9X252

c) use the different profile, motifs, pattern databases to get more information about the domain(s) you found.

d) How do you evaluate the PRINTS tropomyosin annotation in this TrEMBL entry (Q9WZH0)? List of useful links:

basic BLAST or advanced BLAST or PSI-BLAST TMPred prediction tool for transmembrane regions (or TMHMM) COILS prediction tool for coiled-coil regions SignalP prediction tool for signal-peptide cleavage site

Profile, domain, motifs databases and search sites: PFSCAN InterPro (Pfam, PRINTS, PROSITE, SMART) HITS