wellcome trust workshop working with pathogen genomes module 3 sequence and protein analysis (using...

31
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Upload: roxanne-ford

Post on 11-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Wellcome Trust Workshop

Working with Pathogen Genomes

Module 3 Sequence and Protein Analysis(Using web-based tools)

Page 2: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

PSU Projects

Organism

Annotated genome

Finished genome

Database entry

Artemis&

ACT

Page 3: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

PrimaryDNA sequence

Dotter BlastN BlastX

Gene finders

tRNA scan

Repeats Pseudo-genesrRNACDSs

tRNA

Preannotationmanual curation

Page 4: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

PrimaryDNA sequence

Dotter BlastN BlastX

Gene finders

tRNA scan

Repeats Pseudo-genesrRNACDSs

tRNA

Fasta BlastP Pfam Prosite Psort SignalP TMHMM

PreannotationManual curation

Manual curation

Annotatedsequence

Page 5: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Gene model annotation Protein function

Page 6: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Annotation of Protein-coding genes: (from gene model to protein function)

-search programs: local (BLAST) and global (FASTA) alignments, EST hits

-Protein domains and motifs: InterPro (Pfam, Prosite, SMART etc.)

-Transmembrane / signal peptide prediction (TMHMM, SignalP)

- Base annotation on characterised proteins where possible (manually curated UNIPROT entry)

-Read the literature (PUBMED)

Use several lines of evidence!

Page 7: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Annotation of non-protein-coding genes: (tRNAs, rRNAs, snRNAs, other ncRNAs)

-Initial searches:-BlastN, GC-plots-tRNA scan-sno scan-Others

-Search in specialised databases:-Rfam scan-microRNAdb etc.

-Comparative ncRNA prediction tools: -RNAZ-Evofold-QRNA etc.

-Structure prediction of ncRNAs:- MFOLD-Others

Use several lines of evidenceStructural conservation of ncRNAs

Page 8: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

(Global)FASTA

BLAST(Local)

Page 9: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Statistical significance of database hits

• “P-value” - the estimated probability that the match observed could have occurred by chance

or

• E-value - the number of results with this score expected by chance (assuming a specific distribution of residues).

An E-value of 5 would mean that you would expect 5 alignments with the equivalent or higher score to have occurred by random chance

More reliable than the % ID

Statistical estimates like these are strongly influenced by the size and composition of both the search sequence and the database.

Caution: • Repeat regions • Transitive annotation of non-curated protein sequences

Page 10: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Sequence similarity searching:BLAST (Basic Local Alignment Search Tool) analysis:

Nucleotide sequences:

blastn: nucleotide sequence compared to nucleotide database

blastx: nucleotide sequence translated and all 6 frame translations compared to

protein database

tblastn: protein query vs translated database

Protein sequences

blastp: protein query vs protein database

tblastx: translated query vs translated database (all 6 frames)

FastA:

 Provides sequence similarity and homology searching against nucleotide and protein databases

using the Fasta programs. Fasta can be very specific when identifying long regions of low

similarity especially for highly diverged sequences.

Page 11: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

..HMPLKHPLHP..

..RMLLKHRPHP..

..GMRLKHGHHP..

..PMGLKHAGHP..

..-M-LKH--HP..Profile

aligned sequences

Protein profiling

Page 12: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

..HMPLKHRLHP..

..RMPLKHRPHP..

..GMRLKHRHHP..

..PMGLKHAGHP..

..-MPLKHR-HP..Profile

aligned sequences

More sophisticated protein profiles score each amino acid in the motif

Hidden Markov Models (HMMs): The HMM is a statistical model that considers all possible combinations of matches, mismatches and gaps to generate an alignment of a set of sequences.

Page 13: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Profile based predictors of protein domains / motifs

Motif database in form of regular expressions. Not necessarily the whole domain.

K-x(12)-[DE] = lysine, any 12, Aspartic acid or Glutamic acid.Returns 1 or 0, i.e. very rigid and can be very inaccurate for small simple motifs

Motif search tools based on Prosite but with multiple alignment profiling

Collection of HMM’s usually covering the whole domain

Page 14: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Functional assignment: domain architecture

A

B

A

B

C

A

B

C

Page 15: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Page 16: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Page 17: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Page 18: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Page 19: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Page 20: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Page 21: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

InterPro Server: • The ‘one-stop shop’ for accessing all major protein databases• InterPro provides an integrated view of the commonly used signature databases, and has an interface for text- and sequence-based searches.

Page 22: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

InterPro: member databases

Page 23: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Retrieving a sequence using SRS

Page 24: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Retrieving a sequence using SRS

Page 25: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

The SignalP 3.0 Server:

Page 26: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

The SignalP 3.0 output:

Page 27: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

The TMHMMv2.0 Server:

Page 28: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

The TMHMM v3.0 output:

Tabular part

Graphical part

Page 29: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Page 30: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Page 31: Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

Module 3 Exercises:Section A:

• Sequence retrieval of a P. falciparum protein (cyclophilin) using SRS• BLAST and Fasta searches by cutting & pasting the sequence.

Section B:Exercise 1 Part I (row 1):

• Search PROSITE server by cutting & pasting the cyclophilin sequenceExercise 1 Part II (row2):

• Pfam serverExercise 1 Part III (row3):

• SMART serverExercise 1 Part IV (row4):

• InterPro serverExercise 2:

• Sequence retrieval of P. falciparum PFC0125w protein using SRS. • TMHMMv2.0 server. • SignalPv3.0 server.

Section C:

• Other web resources