bioinformatics – sequence analysis

63
1 Bioinformatics – Sequence analysis Magnus Alm Rosenblad Cell and Molecular Biology/GU Free electronic book at UB WWW: ”Bioinformatics and Functional Genomics” Author: Jonathan Pevsner Computational biology Databases Sequence analysis Structural bioinformatics Microarray analysis Systems biology Bioinformatics Systems biology tries to create mathematical models of biological systems and processes. It uses data from bioinformatics and functional genomics.

Upload: others

Post on 12-Sep-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics – Sequence analysis

1

Bioinformatics – Sequence analysis

Magnus Alm Rosenblad Cell and Molecular Biology/GU

Free electronic book at UB WWW: ”Bioinformatics and Functional Genomics” Author: Jonathan Pevsner

Computational biology

Databases Sequence analysis

Structural bioinformatics

Microarray analysis

Systems biology

Bioinformatics

Systems biology tries to create mathematical models of biological systems and processes. It uses data from

bioinformatics and functional genomics.

Page 2: Bioinformatics – Sequence analysis

2

Databases?

•  A database is a structured collection of records (data) that is stored in a computer system. (Could be a simple text file.)

•  Records have one or more identifier (“key”) plus several columns (”fields”) with data

•  Each database has its own set of identifiers but may also contain identifiers used in other databases (to make links between databases)

•  Primary databases: original data Ex. Genbank/EMBL/DDBJ, SWISSPROT

•  Secondary databases: data from other databases + “analyses” Ex. Pfam, PROSITE

EBI

GenBank

DDBJ

EMBL

Entrez

SRS

getentry

CIB

NCBI

• Submissions • Updates • Submissions

• Updates

• Submissions • Updates

International Sequence Database Collaboration

USA Europe

Japan

Page 3: Bioinformatics – Sequence analysis

3

NCBI Databases: identifiers •  Pubmed: scientific papers (identifier=“PMID”) •  Taxonomy: all organisms and groups (TAXID) •  Nucleotide: nucleotide sequences (GI + acc) •  Genome: complete genome sequences (use nucleotide ID) •  Protein: protein sequences (GI + acc)

Example: “Signal sequences get active.” PMID: 19219017 Example: gi|219681188|dbj|CJ999201.1 -- from DDBJ/Japan Example: gi|66547259|ref|XP_396445.2 -- Refseq, protein (306 aa) Example: gi|48119433|ref|XP_396445.1 – Refseq (old, 413 aa)

“GI” = Genbank identifier “acc” = accession number, an identifier type with versions: xxx.1, xxx.2 etc.

SEQUENCE ANALYSIS

Where and why ?

Sequencing projects, assembly of sequence data Gene prediction Identification of functional elements in sequences Sequence comparison Classification of proteins Comparative genomics RNA structure prediction Protein structure prediction Evolutionary history

Page 4: Bioinformatics – Sequence analysis

4

Sequence analysis

•  Biological sequences.* Central dogma. •  Analysis of primary, secondary, not tertiary ...

structures •  Similarities (orthologs, paralogs) •  Methods, algorithms (alignments, models) •  Databases (primary, secondary)

* Because DNA, RNA and protein molecules are polymers, we can treat them as strings of characters and use methods from computer science!

Sequences: DNA, RNA, protein ...

Genome: DNA transcription

Primary transcript: pre-mRNA, pre-ncRNA processing (splicing*, cleavage)

Processed transcript: mRNA, ncRNA (tRNA, rRNA ...) translation, modification

[a] Translated sequence: protein (amino acids). [b] Mature ncRNA protein cleavage ... Mature protein.

[ ESTs are nucleotide sequences, might be unspliced, spliced ...]

* Splicing only occurs in Eukaryotes (almost true).

Page 5: Bioinformatics – Sequence analysis

5

Genome sizes, overview Viruses 5 Kb 1.2 Mb (HIV-1: 10 Kb) Mitochondria* 6 Kb 2.5 Mb (some have no genome) Plastids* 35 Kb 230 Kb (most are photosynthetic)

Bacteria** 160 Kb 10 Mb (E.coli: 4.6-5.5 Mb)

Fungi 3 Mb 120 Mb (S.cervisiae: 12 Mb) Plants 12 Mb 100 Gb ! Mammals*** ~3 Gb (~ 1000 times E.coli)

protein size 20 aa 35,000 aa!

* Mitochondria and plastids are bacterial endosymbionts once engulfed by a eukaryote. ** Bacteria have 1-3 chromosomes. Bacteria also have plasmids. Carsonella rudii = 160 Kb *** Human non-coding DNA: ~98.5% of the genome for a human is non-protein-coding, as opposed to 11% of the genome for the bacterium E. coli

Sequencing, assembly, dbs

sequence reads

Sequencing

NCBI TraceDB (not always)

Sequencing center such as JGI, Broad ...

assembly: reads contigs

assembly: contigs scaffolds/supercontigs

Sequencing center www (not always)

Sequencing center www (not always)

+ preliminary gene/protein predictions Sequencing center www (not always)

draft assembly: chromosomes NCBI Nucleotide?, NCBI Protein?

finished: chromosomes, proteins NCBI Nucleotide, Genome?, Protein

Page 6: Bioinformatics – Sequence analysis

6

FlyBase, gene example (1/2)

Gene symbol Gene full name Type

Chromosome

Map

Species IDs

Sequence location

FlyBase, gene example (2/2)

Gene

mRNA

CDS

mRNA info

How many transcripts? How are they different? Exons? How many proteins?

5’ ?

UTR ? introns ?

Page 7: Bioinformatics – Sequence analysis

7

NCBI Genbank entry for contig (Oamb) gene complement(165253..190776) /gene="Oamb" /locus_tag="Dmel_CG3856" /db_xref="FLYBASE:FBgn0024944"

mRNA complement(join(165253..167342,174570..175092, 176330..176566,176829..177120,189554..189985, 190715..190776)) /product="CG3856-RC, transcript variant C“

mRNA complement(join(165253..167342,174570..175092, 176330..176566,176829..177120,182196..182621, 189554..189985,190715..190776)) /product="CG3856-RA, transcript variant A"

CDS complement(join(166417..167342,174570..175092, 176330..176566,176829..177080)) /note="CG3856 gene product from transcript CG3856-RA" /product="CG3856-PA, isoform A" /protein_id="AAF55798.2"

CDS complement(join(166417..167342,174570..175092, 176330..176566,176829..177080)) /note="CG3856 gene product from transcript CG3856-RC" /product="CG3856-PC, isoform C" /protein_id="AAF55796.1"

mRNA complement(join(168640..170083,174570..175092, 176330..176566,176829..177120,182196..182621, 189554..189985,190715..190776)) /product="CG3856-RB, transcript variant B"

CDS complement(join(169182..170083,174570..175092, 176330..176566,176829..177080)) /note="CG3856 gene product from transcript CG3856-RB" /product="CG3856-PB, isoform B" /protein_id="AAF55797.1"

Compare mRNA & CDS. Two or three 5’ UTR exons? All have 4 coding exons ...

transcript B, 5’ UTR: 177081-177120 + ... 3’ UTR: 168640-169181

AE003731, 230001 bp DNA Drosophila melanogaster chromosome 3R, section 69 of 118

UCSC Genome Browser

Genes

EST

Location How is the direction shown?

Exons, introns?

UTRs? CDS?

Transcripts?

mRNA

Page 8: Bioinformatics – Sequence analysis

8

Similarity -- Function

•  Molecules that have the same function often have similar sequences

•  Molecules that have the same or similar sequence often have the same function

Sequence analysis can give a lot of information about the function

Biological problem, sequence analysis Common biological problem:

We have a novel protein sequence. What can we infer from this sequence about the biological function of the protein?

* Sequence homology - BLAST, FASTA, SSEARCH Simple example: a human protein is highly similar to a protein with known function from another organism => The human protein has a related function (it’s a homolog: ortholog or paralog)

* Pattern/profile search – PROSITE, Pfam (known motifs?) ** Secondary structure precition (proteins, ncRNA) ** Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!)

Page 9: Bioinformatics – Sequence analysis

9

Example of similarity: 2 proteins QUERY: Mus musculus Signal recognition particle receptor subunit beta (SR-beta) SUBJECT: unknown human protein Identities = 245/271 amino acids (90%) Gaps = 2/271

•  Query 1 MASANTRRVGDG--AGGAFQPYLDSLRQELQQRDPTLLSVAVALLAVLLTLVFWKFIWSR 58 •  MASA++RRV DG AGG FQPYLD+LRQELQQ DPTLLSV VA+LAVLLTLVFWK I SR •  Sbjct 1 MASADSRRVADGGGAGGTFQPYLDTLRQELQQTDPTLLSVVVAVLAVLLTLVFWKLIRSR 60

•  Query 59 KSSQRAVLFVGLCDSGKTLLFVRLLTGQYRDTQTSITDSSAIYKVNNNRGNSLTLIDLPG 118 •  +SSQRAVL VGLCDSGKTLLFVRLLTG YRDTQTSITDS A+Y+VNNNRGNSLTLIDLPG •  Sbjct 61 RSSQRAVLLVGLCDSGKTLLFVRLLTGLYRDTQTSITDSCAVYRVNNNRGNSLTLIDLPG 120

•  Query 119 HESLRFQLLDRFKSSARAVVFVVDSAAFQREVKDVAEFLYQVLIDSMALKNSPSLLIACN 178 •  HESLR Q L+RFKSSARA+VFVVDSAAFQREVKDVAEFLYQVLIDSM LKN+PS LIACN •  Sbjct 121 HESLRLQFLERFKSSARAIVFVVDSAAFQREVKDVAEFLYQVLIDSMGLKNTPSFLIACN 180

•  Query 179 KQDIAMAKSAKLIQQQLEKELNTLRVTRSAAPSTLDSSSTAPAQLGKKGKEFEFSQLPLK 238 •  KQDIAMAKSAKLIQQQLEKELNTLRVTRSAAPSTLDSSSTAPAQLGKKGKEFEFSQLPLK •  Sbjct 181 KQDIAMAKSAKLIQQQLEKELNTLRVTRSAAPSTLDSSSTAPAQLGKKGKEFEFSQLPLK 240

•  Query 239 VEFLECSAKGGRGDTGSADIQDLEKWLAKIA 269 •  VEFLECSAKGGRGD GSADIQDLEKWLAKIA •  Sbjct 241 VEFLECSAKGGRGDVGSADIQDLEKWLAKIA 271

Unknown = SR-beta?

Similarity Homology ? Comparing non-identical sequences Protein sequence comparison - basic concepts

When two protein sequences are being compared and the similarity is considered statistically significant, it is highly likely that the two proteins are evolutionary related. There are two kinds of biological relationships:

Orthologs Proteins that carry out the same function in different species

Paralogs Proteins that perform different but related functions within one organism

Proteins are homologous if they are related by divergence from a common ancestor.

Page 10: Bioinformatics – Sequence analysis

10

Homology: orthologs & paralogs

Orthology describes genes in different species that derive from a common ancestor. (=MouseA, ChickA, FrogA that come from Alfa-chain gene in common ancestor) Speciation!

Paralogy describes homologous genes within a single species that diverged by gene duplication (= MouseA and MouseB). (Example globin duplication: basal vertebrate)

Sr54_arcfu

Ftsy_aquae

Shared domain(s) with same function

Example: Homology, domain architecture

Common ancestry, different function

Orthologs/paralogs?

RNA binding domain

Page 11: Bioinformatics – Sequence analysis

11

Protein similarity: yeasts, chordates

Mouse and human proteins are very similar.

Candida glabrata is “closest” to S.cerevisiae but not as similar as we may think.

Even in closely related organisms some orthologs are quite different. For more distant species these may be hard to identify.

Proteins evolve at different rates!

Dujon B. Trends Genet. 2006 Jul;22(7):375-87.

Orthologous sequences compared.

“Evolution”: yeasts, chordates

Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Dujon B. Trends Genet. 2006 Jul;22(7):375-87.

Average protein identity between orthologs

Page 12: Bioinformatics – Sequence analysis

12

Methods in sequence analysis •  Simple transformation/extraction

a) Translation: RNA > protein b) Reverse translation protein>RNA c) Splicing (removing introns in pre-mRNA, pre-rRNA ...)

•  Comparison of primary sequences a) Identity: finding sites, pattern matches b) Alignments: non-identical seqs (pair/multiple/phylogeny)

•  Analyzing for other properties a) statistical composition (GC%, CpG islands, ...) b) profile analysis (PSI-Blast, Pfam HMMs) c) predicting transmembrane domains (TMHMM) d) higher order stucture (secondary structure in RNA/prot)

Translation of sequences mRNA protein

•  Different nucleotide sequences may translate into identical amino acid sequences.

•  Nucleotide sequence may yield different amino acid seqs. (6 reading frames)

•  Reverse translation does not give unique nucleotide sequence.

•  Different splicing of pre-mRNA 1 gene – several proteins! (Eukaryotes only)

Page 13: Bioinformatics – Sequence analysis

13

The (degenerate) Genetic code UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop*

UUG Leu L UCG Ser S UAG Stop* UGG Trp W

CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R

CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R

AUU Ile I ACU Thr T AAU Asn N AGU Ser S

AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R

AUG Met M ACG Thr T AAG Lys K AGG Arg R

GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G

GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G

Translation:

AUGUUGGGUUGA=MLG* ||| | || | | AUGCUAGGAUAA=MLG*

Reverse translation:

MLG* = AUG UUA GGU UAA 1 AUG UUA GGU UAG 2 AUG UUA GGU UGA 3 ... . AUG CUG GGG UGA 72 (1x6x4x3 possible seqs)

3rd position is less important! 71 different substitutions give MLG*-sequence The nucleotide sequences for a protein in different organisms may be very different!

UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop* UUG Leu L UCG Ser S UAG Stop* UGG Trp W

CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R

CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R

AUU Ile I ACU Thr T AAU Asn N AGU Ser S

AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R

AUG Met M ACG Thr T AAG Lys K AGG Arg R

GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G

GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G

Translation:

AUGUUGGGUUGA=MLG* ||| | || | | AUGCUAGGAUAA=MLG*

AUGUUGGGUUGA=MLG* AUGUUAGGUUGA=MLG* AUGUUCGGUUGA=MFG* AUGUGAGGUUGA=M*G*(=M*!)

AUG-UGGGUUGA=MTV(+GA.) Frameshift=> new AA seq Last example: no Stop!

Changes that affect translation

Substitution single aa is changed (maybe), unless STOP Insertion/deletion (“indel”) rest of aa-sequence is changed

Page 14: Bioinformatics – Sequence analysis

14

Open Reading Frame (ORF) Forward reading frames:

Frames 1-3 AUGUUGGGUUGA=MLG* .UGUUGGGUUGA=CTV ..GUUGGGUUGA=VGL ...UUGGGUUGA= LG*

Backward reading frames:

Frames 4-6 on reverse (minus) strand: AUGUUGGGUUGA original AGUUGGGUUGUA rev UCAACCCAACAU +complement = STQH, QPN, ...

1 AUGUUCCGUCUCACGCUCACCAAACGGCUAGCCCGCGCUUCUGCACACGUCACUCCGUCG 60 ------------------------------------------------------------ UACAAGGCAGAGUGCGAGUGGUUUGCCGAUCGGGCGCGAAGACGUGUGCAGUGAGGCAGC

M F R L T L T K R L A R A S A H V T P S C S V S R S P N G * P A L L H T S L R R V P S H A H Q T A S P R F C T R H S V A ------------------------------------------------------------ H E T E R E G F P * G A S R C V D S R R T G D * A * W V A L G R K Q V R * E T A N R R V S V L R S A R A E A C T V G D G Frame 4-6

Example unknown RNA:

Pairwise alignments:

Global alignment (Needleman-Wunsch, ClustalW) Considers similarity across the full extent of the sequences xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx || | ||||||| | | | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Local alignment (most common, Smith-Waterman, BLAST) Considers regions of similarity in parts of the sequences only. xxxxxxx ||||||| xxxxxxx region of similarity

Page 15: Bioinformatics – Sequence analysis

15

Page 16: Bioinformatics – Sequence analysis

16

Problem with finding best alignment

•  Idea: (a) construct a system to score similarity*, and then (b) calculate the score for every possible alignment. (c) highest score = best alignment

•  Number of alignments for sequences of length N: ~ 2^2N / sqroot(6N).

Example: N = 300 10^179 alignments! •  (a) and (c) are good parts, but we have to find another

way of finding the highest scoring alignment!

* Match is good, mismatch not so good, gap is bad!

New idea

•  Strategy: find a way to solve the smallest problems and then use these to get the rest

•  Recursive strategy OK, but too slow •  Idea for algorithm: Save the results for the

previous problems so that we don’t have to calculate them again.

•  Strategy = “Dynamic programming” Biological algorithm = “Needleman-Wunsch”

Page 17: Bioinformatics – Sequence analysis

17

Needleman-Wunsch algorithm (1) Used scores: match = +2 mismatch = -1 ; gap = -2

1. Sequences written outside matrix of size m+1 x n+1, each residue per col/row. Fill col/row 0.

2. Start at (1,1). For each cell, 3 scores for alignment of two residues/gap are calculated (white squares). The best score in each cell is saved (grey squares)! What move best score?

3. Save direction to best score in a “trace matrix”.

Note: The last residues may not align.

Sequences to be aligned (m,n).

Needleman-Wunsch algorithm (2)

Here we show best scores and trace direction for all cells. Traceback may be calculated from scoring matrix if no trace matrix. Start at (m,n), how did we get that score?

This is a “dynamic programming” algorithm. Easily modified to get local alignment.

Running time is proportional to the lengths multiplied (m x n). Very slow algorithm, cannot be used in big database searches!

Used cores: match +5, mismatch -2, gap -6

Page 18: Bioinformatics – Sequence analysis

18

Smith-Waterman finds better local alignment

In the example the global alignment misses the exact match.

Needleman-Wunsch may be easily modified to get the best local alignment: Smith-Waterman algorithm.

Problems with Needleman-Wunsch and Smith-Waterman algorithms

•  The running time is proportional to m x n: sequences 10m, 10n time = 100 x mn if we want to compare 1,000,000 sequences ...

•  We always calculate the whole matrix and do the traceback even though the sequences have no sequence similarity what so ever.

•  Homologous sequences often share regions with high similarity ... can we use this?

Page 19: Bioinformatics – Sequence analysis

19

BLAST lists all matching “words”*

Query

Subject

For each short match, the program tries to extend in both directions. This way, we only align regions that have some sequence identity!

* A word is 7-11 nucleotides or 2-3 aminoacids

ATCGGAT

ATCGGAT

CTCAGAG

CTCAGAG

CCCGGCC

BLAST and FastA and BLAT Searching databases with BLAST:

Initial search is for short words. Word hits are then extended in either direction. we only extend words that are in both sequences fast, but gap can’t be long between two close words

Searching databases with FastA (FastP for proteins): Initial search for short words. Words are extended, but also linked if they are close! slower, but longer alignments (good for nucleotide seqs)

Using BLAT to search genomes (UCSC Genome browser): Tables with all “words” are already calculated for genomes! very fast, but precalculation must have been done

Page 20: Bioinformatics – Sequence analysis

20

BLAST “word filtering” example

1.  ATGGCGATGT ATG: pos 1,7 TGG: pos 2 TGT: pos 8 GAT: pos 6 GGC: pos 3 GCG: pos 4 CGA: pos 5

2. ATCGCAATAC

3. ATCGCGCAT

“Nucleotide BLAST”. If word size=3, will BLAST align seq 1 with seq 2 or 3 ?

1. ATGGCGATGT || |||

3. ATCGCGCAT

1. ATGGCGATGC || || || |

2. ATCGCAATAC

Pretty similar. 7/10 matches, but word hit??

Less similar but ...

BLAST creates a lookup table for “words” of size 3. Is any word found in seq 2 and 3?

Note: BLAST only reports alignments > 18 nts Note: NCBI BLAST word size is min 7.

1. ATGGCG-ATgt || ||| ||

3. ATCGCGCAT Word hit is extended 7 matches

Which will BLAST try to align ... ?

A nucleotide alignment that NCBI BLAST can’t find!

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

There are no matching regions > 6 nts. Smaller word size must be used.

Page 21: Bioinformatics – Sequence analysis

21

RNA: Conserved secondary structure AU, GC base pairing create ”hairpins”

CAGGAAACUG seq1 ...|.||... GCUGCAAAGC seq2 |||||||...

GCUGCAACUG seq3

A A C A C A G A G A G A G-C U-A U C A-U C-G C U C-G G-C G G

seq1 seq2 seq3

Seq1 and seq2 are not similar, but they both have a hairpin structure, which is not shared by seq3!

The alignment of the primary sequences (structure) doesn’t give us any information.

ncRNAs are often not annotated

NC_006270.1 -TTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACTCGCAATCCGCTCGAGCGAGGC X06802|BAC.SUB. NTTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACCTGCAATCCGCTCTAGCAGGGC ************************************* *********** *** ***

NC_006270.1 CGAATCCCTTTCTCGAGGTTCGTTTACTTTAAGGTCTGCCTTAAGCAAGTGGTGTTGACG X06802|BAC.SUB. CGAATCCCTT-CTCGAGGTTCGTTTACTTTAAGGCCTGCCTTAAGTAAGTGGTGTTGACG ********** *********************** ********** **************

NC_006270.1 CTTGGGTCCTGCGCAATGGGAATCCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTA X06802|BAC.SUB. TTTGGGTCCTGCGCAATGGGAATTCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTA ********************** ************************************

NC_006270.1 AGTGGAACCTTCCATGTGCCGCAGGGTTGCCTGGGCTGAGCTAACTGCTTAAGTAACGCT X06802|BAC.SUB. AGTGAAACCTCTCATGTGCCGCAGGGTTGCCTGGGCCGAGCTAACTGCTTAAGTAACGCT **** ***** ************************ ***********************

NC_006270.1 TAGGGTAGCGAATCGACAGAAGGTGCACGGTA X06802|BAC.SUB. TAGGGTAGCGAATCGACAGAAGGTGCACGGTA ********************************

Sequence alignment of annotated SRP RNA from Bacillus subtilis and identified SRP RNA from the newly sequenced and “fully” annotated Bacillus licheniformis. Sequence identity = 94%! Still no SRP RNA is annotated. SRPDB is needed.

Page 22: Bioinformatics – Sequence analysis

22

ncRNAs in the 3 Kingdoms of Life

Rfam: annotating non-coding RNAs in complete genomes. Sam Griffiths-Jones, Simon Moxon, Mhairi Marshall, Ajay Khanna, Sean R. Eddy and Alex Bateman. Nucleic Acids Res. 2005 33:D121-D124.

Examples BLAST can’t find

•  Proteins: frequent substitutions (1 aa / 3 aa) use word size 2 (BLAST default is 3) make sure DB has close relatives (or profile methods)

•  mRNA: many synonymous substitutions (1/7) search on protein level

•  ncRNA genes with “compensatory base changes” that preserve secondary structure use methods that allow small “word size”, close relatives

(OR search with secondary structure motif)

Page 23: Bioinformatics – Sequence analysis

23

How to score aligned nucleotides Substitution matrices

A T C G A 1 0 0 0 T 0 1 0 0 C 0 0 1 0 G 0 0 0 1

A T C G A 1 -3 -3 -3 T -3 1 -3 -3 C -3 -3 1 -3 G -3 -3 -3 1

Unitary matrix BLAST matrix

All nucleotides considered equally probable in both. BLAST matrix: more matches needed to allow mismatches.

(Gaps should score worse than mismatch score.)

Properties of Amino acids While nucleic acids are quite similar, amino acids have different chemical properties. They may be hydrophilic (polar, or even charged), hydrophobic or neutral etc.

These properties are important for the structure and function of the protein.

If an amino acid is changed, is it probable that the “new” aa has a different type of property?

Should a DE or ST change get zero score? D E

S T

Page 24: Bioinformatics – Sequence analysis

24

Are there better/worse substitutions?

•  From comparisons of known proteins, it is known that some changes/mutations are more frequent than others.

•  Also, not all amino acids* are common ... If a rare amino acid is matched, it is more significant than if a common amino acid match

•  How can we give a score to a mismatch/match that is biologically significant? substitution matrices for proteins

* There are 20 amino acids, but only 4 nucleotides!

Scores for aligned amino acids (1/3)

•  We want to have a score for a match/mismatch of two aligned amino acids (a and b) that is based on the probability of finding these aligned, as different pairs are more or less likely. Prob(a,b) (We assume for the moment we can get that value somehow.)

•  But that is not enough, if we want to know if it is biologically significant. We need to compare Prob(a,b) to the probability that a and b are uncorrelated (occurring independently), so we want the probabilities of observing a and b on average in a protein (frequencies), and multiply them to get the probability for an alignment by chance f(a) * f(b).

•  We now have a formula for comparing the probabilities (“odds-ratio”): Prob(a,b) / f(a) * f(b)

If we expect that the amino acids are aligned more often in homologous sequences, then Prob(a,b) > f(a) * f(b), and the odds-ratio is > 1

•  In statistics usually the logarithm of this is used, to get a “log-odds” score**. log ( Prob(a,b) / f(a) * f(b) )

If Prob(a,b) > f(a) * f(b), then this expression is positive since log 1 = 0, otherwise negative. •  To get a “nice” score we multiply this log-odds value with a constant (C) and round it off.

Ex. If C is 3 and the log-odds is 1.371, we get : 3 * 1.371 =~ 4, the score is thus +4. •  Finished formula for calculating he score for all amino acid pairs:

score (a,b) = C * log ( Prob(a,b) / f(a) * f(b) )

** By using log we may also add all scores instead of multiplying the odds ratios.

Page 25: Bioinformatics – Sequence analysis

25

Scores for aligned amino acids (2/3)

•  Wonderful, we now have a score formula: score (a,b) = C * log ( Prob(ab) / f(a) * f(b) )

•  But what exactly is Prob(ab) and f(a), f(b) ... ? •  Frequencies are easy to calculate for each aa: f(X) = occurrances of X / #all amino acids.

Ex. Leucine is common: f(L) = 0.099; Tryptophan is rare: f(T) = 0.013, etc. Aligning L and L by chance: 0.099 * 0.099 = 0.01 (1%); L and T: 0.099 * 0.013 = 0.0013, etc.

•  Now we only need Prob(a,b) for all amino acid pairs! •  We want the Prob(a,b) values to be as “biologically significant” as possible.

Idea: –  Get a lot of sequences that we know are homologous (same “ancestor”). –  Make alignments and then count how often all the different amino acid pairs occur! –  For “BLOSUM” matrices, the BLOCKS database was used to get the alignments. –  To get different sets of scores depending on whether we are comparing very similar sequences or less

similar sequences, only use alignments with a minimum identity. –  BLOSUM matrices have variants: BLOSUM50, BLOSUM62 ... where the number is the min id.

•  Now we can calculate scores for all possible aligned amino acids, and put the values in a 20x20 substituion matrix to be used in alignment programs.

* The Pevsner book have older values.

Scores for aligned amino acids (3/3)

Ex 1. What is the score for aligning the rare W with W (Tryptophan) ? •  Calculate score with values:

Prob(W,W) = 0.0065 ; f(W) = 0.013; C = 1 / 0.347 (0.347 is the “λ value”) •  log ( Prob(W,W) / f(W)*f(W) ) = log (0.0065 / 0.013 * 0.013) = 3.64

score (W,W) = (1 / 0.347) * 3.64 = 10.52 +11

Ex 2. Score for aligning the common L with L (Leucine) ? •  Calculate score with values:

Prob(L,L) = 0.0371 ; f(L) = 0.099; C = 1 / 0.347 •  log (Prob(L,L) / f(L)*f(L) ) = log (0.0371 / 0.099^2) = 1.33

score (L,L) = (1 / 0.347) * 1.33 = 3.84 +4

Ex 3. Score for aligning the A with L ? Prob(A,L) = 0.0044; f(A) = 0.074. •  score (A,L) = 1/0.347 * log (0.0044 / 0.074*0.099) = -0.5 / 0.347 = - 1.46 -1

Note: log = natural logarithm, ln.

Page 26: Bioinformatics – Sequence analysis

26

BLOSUM 62 scores A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

PAM matrices: “accepted point mutation”

•  What point mutations are accepted by evolution? •  Dayhoff et al. examined 71 groups of proteins and counted all mutations =

“accepted mutations” •  Calculated mutation probabilities for all amino acids. •  One “PAM” = 1% of the proteins mutated •  PAM1 matrix for closely related proteins/organisms •  PAM100, PAM 250* for less similar proteins. •  PAMs use a similar log-odds score as for the BLOSUM matrices to contsruct

the substitution matrix. •  PAM matrices are less used than BLOSUM, but all are distributed with

BLAST. •  Read more in the book (page 50-) if you are interested.

* 250 mutations / 100 is possible because each position may be mutated several times.

Page 27: Bioinformatics – Sequence analysis

27

Substitution matrices (summary)

Unitary matrices (nucleotide, protein) All matches get ’10’, all mismatches ’0’ (BLAST UNIT), or +1, -3 (BLASTN). Used for nucleotide seqs. Bad protein hits due to identities by chance.

Point Accepted Mutation, PAM (proteins) PAM30, PAM70 ... matrices. Based on evolutionary distance: 1 PAM = 1 point mutation / 100 residues. Can’t handle distant relationships well.

Blocks Substitution Matrix, BLOSUM (prots) BLOSUM50, BLOSUM62 ... matrices. Based on alignments in the BLOCKS db. Sequence segments of a certain identity are clustered: The most used matrices. BLOSUM62 default in BLAST (>62% identity).

Remember: Any substitution matrix is making a statement about the probability of observing a pair of aligned residues in real alignments!

BLAST: Raw score, bit score, E-value •  For each alignment a “raw score” is calculated based on the chosen substitution matrix

and gap costs (gap open, gap extend). •  The raw score has several limitations:

–  Shorter alignments with high identity get lower score than longer with less identity, but it is the high identity alignments that are biologically more significant.

–  The raw score does not tell us anything about the probability of finding an alignment by chance.

•  Solution: –  Calculate a normalized score:

bit score = (λ * raw score – ln K) / ln 2. (λ and K are the “Karlin-Altschul” parameters) –  Calculate a value for how many aligments with this score or better that we would find by

chance in a database of this size: Expect-value = search space * 2^(- bit score); search space = query length * db size

•  Relation between score and E-value: –  High score means a low E-value (very few expected hits by chance) –  If database size gets smaller, E-value gets lower (better), score is the same.

•  E-value and P-value –  P-value is the probability of a chance alignment (E-value is the number of alignments) –  E-value is similar to P-value for E < 0.1 . (P-values are not reported by BLAST.) –  Equation: P = 1 – e^(-E)

Page 28: Bioinformatics – Sequence analysis

28

BLOSUM62 scoring example

# Matrix made by matblas from blosum62.iij # * column uses minimum score # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62

A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

Query 1 MTPSLSSFLNSLILGAVIVVVPITLALLFVSQKDRTIRS 39 MTPSL SF SL+LG +IVV+P+T+AL+ +SQ D+ R+ MTPSLISFFYSLLLGTIIVVLPLTIALILISQTDKLKRN

How many identities? How many gaps? Scores for mismatches: Positive/negative?

Scores for mismatches:

SLNIAVVILLFVKRTIS + +++++ + + + IFYLTILLIILITKLKN -2 0 -2 2 0 3 1 2 2 2 0 3 -1 2 -1 -3 Sum 9 1

Scores for matches: MTPSLSFSLLGIVVPTALSQDR 5+5+7+3*4+6+3*4+6+3*4+7+5+3*4+5+6+5 Sum 105

Matrix is symmetrical!

BLOSUM62 scoring example BLAST result

>Cyanidium_caldarium_chl_NC_001840 Length = 164921

Score = 48.5 bits (114), Expect = 9e-08 Identities = 22/39 (56%), Positives = 31/39 (79%) Frame = +1

Query: 1 MTPSLSSFLNSLILGAVIVVVPITLALLFVSQKDRTIRS 39 MTPSL SF SL+LG +IVV+P+T+AL+ +SQ D+ R+ Sbjct: 122065 MTPSLISFFYSLLLGTIIVVLPLTIALILISQTDKLKRN 122181

Scores for 17 mismatches:

SLNIAVVILLFVKRTIS + +++++ + + + IFYLTILLIILITKLKN -2 0 -2 2 0 3 1 2 2 2 0 3 -1 2 -1 -3 Sum 9 1

Scores for 22 matches: MTPSLSFSLLGIVVPTALSQDR 5+5+7+3*4+6+3*4+6+3*4+7+5+3*4+5+6+5 Sum 105

Positive scores = more probable than by chance ...

E-value dependent on db size:

Database: chloroplast_genomes_29.fas 29 sequences; 3,993,082 total letters

Expect = 9e-08 Database: nt 6,385,943 sequences; 22,801,566,233 total letters Expect = 5e-04

4 Mb – 24,000 Mb ~ 6000 x difference in size. 6000 * 9e-08 = 54 e-05 = 5e-04

Page 29: Bioinformatics – Sequence analysis

29

Output from Blast

BLASTP 2.0.11 [Jan-20-2000]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Query= ramp4.seq (75 letters)

Database: nr 457,798 sequences; 140,871,481 total letters

Searching..................................................done

Score E Sequences producing significant alignments: (bits) Value

gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29 gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29 gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13 gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05 gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048 gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6

E-value: number of hits by chance in a database of this size.

>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65

Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++M Sbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63

Query: 74 G 74 G Sbjct: 64 G 64

In protein alignments some mismatches are marked “similar” (+).

The alignment is called “High Scoring Pair” (HSP). There may be several HSPs for each sequence

Page 30: Bioinformatics – Sequence analysis

30

BLASTP: info at end of report

Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1

Number of Hits to DB: 212,176,541 Number of Sequences: 5470121 Number of extensions: 6292746 Number of successful extensions: 36982

Number of sequences better than 10.0: 9 Number of HSP's better than 10.0 without gapping: 6 Number of HSP's successfully gapped in prelim test: 3 Number of HSP's that attempted gapping in prelim test: 36976 Number of HSP's gapped (non-prelim): 9

length of query: 70 length of database: 1,894,087,724 effective HSP length: 42 effective length of query: 28 effective length of database: 1,664,342,642 effective search space: 46601593976 effective search space used: 46601593976

Used substitution matrix Gap costs

Number of “word” hits in DB. Number of sequences in DB. Number of extensions of word.

Number of sequences E < 10 HSP gap info

Number of HSPs ( = 1 / seq) Query length

Calculated search space (effective length of query * effective length of db) used in E-value calculation.

BLASTN: info at end of report

Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2

Number of Hits to DB: 95,310 Number of Sequences: 29 Number of extensions: 95310 Number of successful extensions: 67

Number of sequences better than 10.0: 26 Number of HSP's better than 10.0 without gapping: 26 Number of HSP's successfully gapped in prelim test: 0 Number of HSP's that attempted gapping in prelim test: 0 Number of HSP's gapped (non-prelim): 67

length of query: 108 length of database: 3,993,082 effective HSP length: 14 effective length of query: 94 effective length of database: 3,992,676 effective search space: 375311544 effective search space used: 375311544

Scoring: identity +1, mismatch -3 Gap costs

Number of “word” hits in DB. Number of sequences in DB. Number of extensions of word.

Number of sequences E < 10 HSP gap info

Number of HSPs (>1 / seq) Query length

Calculated search space (effective length of query * effective length of db) used in E-value calculation.

Page 31: Bioinformatics – Sequence analysis

31

BLAST search variants Query OutputAlign Database

blastn DNA dna/dna DNA blastp Protein prot/prot Protein tblastn* Protein prot/prot DNA (6 frames) blastx* DNA (6 fr) prot/prot Protein tblastx* DNA (6 fr) prot/prot DNA (6 frames) * = ”translated BLAST”

Example 1: Searching a new genome assembly for a protein homolog. Input: protein. Database: DNA (genome sequences)

tblastn Example 2: We have DNA sequences and want to find out if they code for a similar protein. ... ??

Conserved synteny ... missing gene?

apcD accD psbV petJ tRNA psbX

apcD accD psbV petJ tRNA ?

apcD accD psbV petJ tRNA psbX

C.merolae

The annotations of three chloroplast genomes from red algae were compared. In a conserved gene cluster, one species lacked a gene.

C.caldarium

P.purpurea

How do we check if the gene is really missing using BLAST?

Query? Searched database? Program? Alternatives?

apcD tRNA psbX accD psbV petJ

Page 32: Bioinformatics – Sequence analysis

32

Search results psbX (TBLASTN)

gene 121234..121725 /gene="apcD" /locus_tag="CycaCp138" /db_xref="GeneID:800290" CDS 121234..121725 /gene="apcD" /locus_tag="CycaCp138" /note="allphycocyanin

gamma chain" /codon_start=1 /transl_table=11 /product="allophycocyanin

gamma subunit" /protein_id="NP_045154.1" gene complement(121897..121979) /locus_tag="CycaCt226" /db_xref="GeneID:1457232" tRNA complement(121897..121979) /locus_tag="CycaCt226" /product="tRNA-Leu" /db_xref="GeneID:1457232" gene 122225..123028 /gene="accD" /locus_tag="CycaCp139" /db_xref="GeneID:800114" CDS 122225..123028 /gene="accD" /locus_tag="CycaCp139" /codon_start=1 /transl_table=11 /product="acetyl-CoA carboxylase

beta subunit" /protein_id="NP_045155.1"

C.caldarium chloroplast: “psbX region”

C.caldarium matching region: >122065-122187 C. caldarium chloroplast ATGACACCAAGTTTGATTTCTTTTTTCTATAGTTTACTTTTA GGAACTATCATTGTCGTTTTACCATTAACAATAGCGCTTATA TTAATTAGCCAAACTGATAAGCTAAAAAGAAAT TTTTAG

BLAST result:

Query= gi|122194748|sp|Q1XDU0.1|PSBX_PORYE Length=39

>ref|NC_001840.1| Cyanidium caldarium chloroplast, complete genome Length=164921

Score = 47.0 bits (110), Expect = 1e-08, Identities = 22/39 (56%), Positives = 31/39 (79%), Gaps = 0/39 (0%) Frame = +1

Query 1 MTPSLSSFLNSLILGAVIVVVPITLALLFVSQKDRTIRS 39 MTPSL SF SL+LG +IVV+P+T+AL+ +SQ D+ R+ Sbjct 122065 MTPSLISFFYSLLLGTIIVVLPLTIALILISQTDKLKRN 122181

Score = 20.0 bits (40), Expect = 1.4, Identities = 7/20 (35%), Positives = 15/20 (75%), Gaps = 0/20 (0%) Frame = +3

Query 4 SLSSFLNSLILGAVIVVVPI 23 +LS FLN ++ ++++VP+ Sbjct 89931 TLSCFLNEMLESLILLLVPL 89990

Full protein sequence?: TTT = F, TAG = ?

BLAST output, with many HSPs

gb|CM000011.1| Canis familiaris chromosome 11, whole genome shot... 86 9e-15

>gb|CM000011.1| Canis familiaris chromosome 11, whole genome shotgun sequence Length = 75769841

Score = 85.7 bits (43), Expect = 9e-15 Identities = 89/102 (87%), Gaps = 3/102 (2%) Strand = Plus / Minus

Query: 4 cgtgctgaaggcctgtatcctaggctacacactgaggactctgttcctcccctttccgcc 63 |||||||||||||||| |||||||||||| || ||||||| ||||||| ||| |||| Sbjct: 53542401 cgtgctgaaggcctgtttcctaggctacagacggaggact-tgttcctta--tttgcgcc 53542345

Query: 64 taggggaaagtccccggacctcgggcagagagtgccacgtgc 105 |||||||||||||||||||| ||||||||||||| ||||| Sbjct: 53542344 taggggaaagtccccggacccttggcagagagtgccgcgtgc 53542303

Score = 75.8 bits (38), Expect = 9e-12 Identities = 75/86 (87%), Gaps = 1/86 (1%) Strand = Plus / Minus

Query: 181 ggggcgtcatccgtcagctccctctagttacgcaggcagtgcgtgtcc-gcgcaccaacc 239 |||||||| ||||||| ||| ||||||||||||||||| ||| | |||| |||||| Sbjct: 53542216 ggggcgtcgtccgtcaactctatctagttacgcaggcagcgcgcctggtgcgcgccaacc 53542157

Query: 240 acacggggctcattctcagcgcggct 265 |||||||||||||||||||||||||| Sbjct: 53542156 acacggggctcattctcagcgcggct 53542131

Score = 36.2 bits (18), Expect = 7.7 Identities = 18/18 (100%) Strand = Plus / Minus

Query: 25 aggctacacactgaggac 42 |||||||||||||||||| Sbjct: 42727936 aggctacacactgaggac 42727919

Note: Only the best HSP is shown in the list before the alignments. Check the positions to understand in which order the HSPs match. The strand must be the same!

Page 33: Bioinformatics – Sequence analysis

33

Aligningtwosequences-Gapextensionpenalty.AlignmentofgenomicsequencewithmRNA(Globalalignment!)

Alignmentofthefollowingtwosequences:V00594(HumanmRNAformetallothionein)andJ00271(correspondinggenomicsequence).

Defaultsetting

Extendgap=3

In a global alignment all residues are matched.

?

!

Newsettings

Extendgap=0Exon 1

Exon 2

Exon 3

Page 34: Bioinformatics – Sequence analysis

34

Rules of database searches (like BLAST)

•  Database sequence searches involving proteins should be carried out at the protein level and not at the DNA level * •  Use of smallest possible database (not too small though, ... homologs?) •  Sequence statistics should be used rather than percent identity/similarity as criterion for homology. E-values < e-03. (But distant homologs ...) •  Consider different scoring matrices and gap penalties

* 1) DNA sequences encoding the same protein sequence can be very different, due to the degeneracy of the genetic code.

TTTCGATTCTCAACAAGAAGC ** * ** ** * * TTCAGGTTTAGCACGCGGTCC F R F S T R S

2) For nucleotide—nucleotide searches, it is often good to set the word size low (-W 7)

1 MSAAPVQDKDTLSNAERAKNVNGLLQVLMDINTLNGGSSDTADKIRIHAKNFEAALFAKS 60

61 SSKKEYMDSMNEKVAVMRNTYNTRKNAVTAAAANNNIKPVEQHHINNLKNSGNSANNMNV 120

121 NMNLNPQMFLNQQAQARQQVAQQLRNQQQQQQQQQQQQRRQLTPQQQQLVNQMKVAPIPK 180

181 QLLQRIPNIPPNINTWQQVTALAQQKLLTPQDMEAAKEVYKIHQQLLFKARLQQQQAQAQ 240

241 AQANNNNNGLPQNGNINNNINIPQQQQMQPPNSSANNNPLQQQSSQNTVPNVLNQINQIF 300

301 SPEEQRSLLQEAIETCKNFEKTQLGSTMTEPVKQSFIRKYINQKALRKIQALRDVKNNNN 360

361 ANNNGSNLQRAQNVPMNIIQQQQQQNTNNNDTIATSATPNAAAFSQQQNASSKLYQ

Low complexity sequence tends to (1) increase the number of non-specific hits to database sequences (2) correspond to regions in proteins not associated with a known biological function (typically unstructured parts of the protein)

Therefore, low complexity parts are filtered out by default in BLAST searches. (Don’t use filtering if you want exact matches.)

Page 35: Bioinformatics – Sequence analysis

35

Databases at NCBI available for BLAST searches

Protein sequence databases

nr All non-redundant GenBank CDS translations +PDB+SwissProt+PIR+PRF

swissprot the last major release of SWISS-PROT uniprot swissprot + TrEMBL (translated EMBL DNA sequences)

DNA sequence Databases

nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences)

dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions

You may also blast against single genomes ...

How can the sequences of protein homologs be so different?

ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT M A K L E K L N Q A G L M V A G

60% nucleotide identity ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA M V R L E K I N Q A G L L V A G 69% amino acid identity

M V R I Q K I N E K G A L L A G 38%

Q V R I Q K I Y E K G A L L A A 19% (‘twilight zone’)

Q V R I Q K I Y E K T A L L F A

6% (‘midnight zone’)

Evolution of protein genes: secondary and tertiary structure conserved

Page 36: Bioinformatics – Sequence analysis

36

BLAST at NCBI What kind of BLAST will you perform?

BLASTP search page 1. INPUT: Sequence or accession number

2. DATABASE: Choose non-redundant nr, SwissProt ...

3. RESTRICT SEARCH? Input organism or organism group to be searched, other sequences are neglected

Sequence

Database Organism

Page 37: Bioinformatics – Sequence analysis

37

BLAST output at NCBI

1 perfect hit, some hits with parts of sequence matched

Alignments below

“HSP” high

scoring pair

– there may be several!

Best hit

Next best hit

Page 38: Bioinformatics – Sequence analysis

38

Sequence analysis II

•  Finding distant homologues with Blink •  Multiple sequence alignments •  PSSM, HMM, profiles •  Domain databases •  Secondary structure •  Transmembrane prediction •  Signal peptide prediction

Eukaryote phylogeny

Baldauf et al. Science 2000

Based on 4 concatenated protein sequences.

Maximum parsimony.

Humans are more like fungi than plants!

Page 39: Bioinformatics – Sequence analysis

39

Protein similarity: yeasts, chordates

Mouse and human proteins are very similar.

Candida glabrata is “closest” to S.cerevisiae but not as similar as may think.

Orthologous sequences.

“Evolution”: yeasts, chordates

Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Dujon B. Trends Genet. 2006 Jul;22(7):375-87.

Average protein identity between orthologs

Page 40: Bioinformatics – Sequence analysis

40

NCBI protein entry, BLink

Name, length, IDs

Sources Keywords Organism and taxonomy References

The NCBI protein database entry for SRP21 from yeast which we searched for.

As many researchers want to make BLAST searches, a precomputed BLAST page is accessible: BLink (“BLAST Link”)

SRP21 YEAST

Finding homologs by iterative BLAST

Fitzpatrick et al., BMC Evolutionary Biology (2006) 6:99

Shown is a phylogenetic tree of 42 fungal species. At the bottom are the Saccharomycotina (2 major groups, red ) and above Pezizomycotina (green), and the last group with only S.pombe (organge).

Starting with a Saccharomyces cerevisiae protein, how many homologs can we find within (Ascomycota) fungi?

Is the S.cerevisiae protein similar enough for BLAST to find the homologs ...

Page 41: Bioinformatics – Sequence analysis

41

Saccharomyces SRP21 BLAST result Chosen protein: SRP21 The SRP21 protein from Saccharomyces cerevisiae is used in a BLAST search to find homologs in other organism groups. But the complete BLAST (BLink) list only contains SRP21 from the two Saccharomycotina groups.

Idea: Pick a distantly related SRP21 and use it as new query ... BLink is the precomputed BLAST search result

available at NCBI.

Debaryomyces SRP 21 BLAST result (BLink)

Aspergillus SRP21 and other

Pezizomycotina

Saccharomycotina SRP21

Debaryomyces belongs to Saccharomycotina, so other SRP21 from the 2 groups are easily found.

Now we also find Pezizomycotina homologs!

But no S.pombe ...

Use Aspergillus SRP21!

Page 42: Bioinformatics – Sequence analysis

42

Aspergillus SRP21 BLAST

S.pombe SRP21 !

Saccharomycotina SRP21

Pezizomycotina SRP21 and other sequences

Aspergillus belongs to the Pezizomycotina group, so those homolgs are easily found. The result also includes some Saccharomycotina.

Now we have finally found the S.pombe homolog.

Finding homologs by iterative BLAST

Fitzpatrick et al., BMC Evolutionary Biology (2006) 6:99

S.cerevisiae SRP21 (Saccharomycotina - A)

Debaryomyces SRP21 (Saccharomycotina - B)

Aspergillus SRP21 (Pezizomycotina)

S.pombe SRP21 !

Acumulated mutations during evolution made the proteins too different. By “jumping” from group to group we bridged the gap.

Page 43: Bioinformatics – Sequence analysis

43

Check that S.pombe SRP21 find ...

Debaryomyces SRP21 (Saccharomycotina)

Other Pezizomycotina SRP21

Aspergillus SRP21 (Pezizomycotina)

If protein is not annotated – check if reciprocal search is successful

SRP21 aligned to SRP9 &14

Unaligned box 21

9

14

Secondary structure prediction by PSI-Pred also showed the conserved αβββα structure.

SRP9/14 αβββα secondary structure (Birse et al.) shown as cylinders (alfa helices) and arrows (beta strands).

The most conserved residues are in secondary structure elements. SRP9, SRP21 more similar.

Residues marked according to similarity in sequence and chemical properties.

21

9

14

Page 44: Bioinformatics – Sequence analysis

44

Sr54_arcfu

Ftsy_aquae

Shared domain(s)

Example: Homology, domain architecture

Common ancestry, different function

Orthologs/paralogs?

RNA binding domain

N-terminal

C-terminal

Two different proteins (4+4 sequences ) are aligned. They share a domain. They are paralogs.

Page 45: Bioinformatics – Sequence analysis

45

Pfam: TreeFam (homologs) FA9_HUMAN example (cont.):

In the displayed tree gene duplications (red dots at nodes; leading to paralogs) och speciation (blue dots, leading to orthologs) are shown.

FA9 and FA10 arose by a gene duplication. Some nodes are hard to decide upon (duplications?).

FA10

FA9

Multiple alignments - applications

Identify conserved motifs - patterns (PROSITE) Profiles (Pfam, PROSITE) Phylogenetic studies Prediction of protein secondary structure Experimental : design of probes

Page 46: Bioinformatics – Sequence analysis

46

Multiple alignment software

Pileup (GCG)

Clustalw / Clustalx

T-coffee

Muscle/MAFFT

Multiple alignment editors/viewers

SeqLab (GCG) Jalview CINEMA Genedoc Bioedit Boxshade

How to find homologs with low sequence identity

•  Similarity gets low if evolutionary distance gets big. •  Many amino acid positions change. •  An amino acid may be substituted differently in

different species. •  If we have many known homologs, we can use all of

them as queries, but the unknown sequence may have yet another set of substitutions compared to the known homologs. align known sequences and make a “profile”

•  The profile has a different substitution matrix for each position in the alignment ...

Page 47: Bioinformatics – Sequence analysis

47

Multiple alignment: env, ClustalX

frequency plot (no gaps!)

information (bits) plot

The probablility of an aminoacid in a sequence is dependent on the position!

Position Specific Substitution Rates

Active site serine Typical serine

Page 48: Bioinformatics – Sequence analysis

48

Position Specific Score Matrix (PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differently in these two positions

Active site nucleophile

Example sequence. How does Serine score in positions 211 and 216?

Amino acids

Pfam use HMMs

A HMM is a statistical model for identifying features in a primary sequence. It is built from an multiple sequence alignment, from which the probabilities are calculated. HMMs are used also in geneprediction (GenScan) ...

Splice site “toy” example. We are looking for the start of the intron.

Page 49: Bioinformatics – Sequence analysis

49

Pfam-Trypsin: “Summary” The summary for the domain contains background, literature references and links to other databases plus much more ....

(... next slide ...)

... Pfam-Trypsin: “Summary” ...

The summary also groups the domain into a “clan” which has several other members, in this case the clan is called Peptidase_PA *.

Gene Ontology terms associated with the domain is listed: proteolysis.

Similarity to other domains by using PRC, a domain-domain search program. Most are in the same clan.

Links

* This clan contains a diverse set of peptidases with the trypsin fold.

Page 50: Bioinformatics – Sequence analysis

50

Pfam-Trypsin: “Interactions” The “Interaction” section gives a list of which domains have shown to interact with the domain.

For instance, coagulase domain is present in a bacterial protein that binds to the trypsin domain of human prothrombin.

Pfam-Trypsin : “Domain organisation” For each domain, Pfam lists all types of domain organisations that contain the domain.

Shown are the first 5 and the FA10-9 organisation, which is the same, is at the bottom with a member proteins displayed.

Gla-EGF-EGF-Trypsin

List: FA9 etc.

Trypsin-PDZ

Trypsin-PDZ-PDZ

Trypsin-Trypsin

Page 51: Bioinformatics – Sequence analysis

51

Pfam-Trypsin: “Curation”, seed Information about how the domain was created.

Where the initial (seed) alignment came from, in this case SCOP and PROSITE, and how many sequences it contained (71).

The “full” alignment is seed plus all found proteins (6237!) that scores above a certain threshold. Seed and Full alignments are found under “Alignments”.

Pfam – protein domains DB

•  From multiple alignments of many related proteins, profiles (profile HMMs) are made

•  Curated and highly trusted Pfam-A (red, green, yellow) and automatic temporary Pfam-B (striped).

•  Input a sequence, match to all families/HMMs.

•  UniProt sequences are in Pfam database. (shown is FA9_HUMAN)

Pfam DB: Karolinska Inst., Sanger (UK), S:t Louis (USA), Pasteur (F)

http://pfam.sanger.ac.uk/

Pfam-A

Page 52: Bioinformatics – Sequence analysis

52

Pfam: “Features”, FA9_HUMAN

Disulphide bonds

InterPro domains

Pfam superfamilies

Signal peptide prediction

In “Features” for a sequence, Pfam incorporates other info than Pfam domains, some imported from other databases. + Prediction of signal peptides.

Holding the cursor over part gives info.

FA9 search at PROSITE

Symbols for active sites (red) and disulphide bridges (grey lines), also in Pfam.

Hold pointer over domain and the sequence will be highlighted.

Almost the same result as Pfam. Second EGF domain in Pfam is not in PROSITE.

PROSITE also contains matches to patterns!

PROSITE does not have an equivalent to Pfam-B models.

http://www.expasy.org/

Page 53: Bioinformatics – Sequence analysis

53

NCBI Conserved Domains

In CDD also Pfam domains are listed.

The search is conducted in a similar but not identical way.

SRS: InterPro – search all domain DB

InterPro domains are also listed in Pfam for each protein

PROSITE Pfam PRODOM PRINTS SMART ... ...

Seq input

Page 54: Bioinformatics – Sequence analysis

54

PSI-BLAST creates profiles automatically

When no more new sequences are found, search terminates.

Problem: If bad sequences enters the profile, it finds only trash!

PSI-BLAST example (POP)

•  The particle RNaseP exists both i yeast and metazoans, but their protein components differ

•  However, several of them have been shown to be homologs:

Metazoa Yeast Rpp25 POP6 Rpp38 POP3

•  Two of them (Rpp14, POP8) do not have a homolog in the other organism. Or?

Page 55: Bioinformatics – Sequence analysis

55

PSI-BLAST example (POP8) QUERY: protein from Neurospora crassa, a fungi, a probable POP8 protein.

QUESTION: Can we see a distant relationship to metazoan proteins? (We are interested in Rpp14.)

BLAST did not give us any new information, only other fungal homologs. A more sensitive method is needed.

We try with PSI-BLAST instead, and look at the results after round 1-5: _________________________________________________________________________________

Results from round 1 Score E Sequences producing significant alignments: (bits) Value

emb|CAD70969.1| hypothetical protein [Neurospora crassa] >gi|324... 221 5e-57 gb|EAA77253.1| hypothetical protein FG07394.1 [Gibberella zeae P... 126 2e-28 ref|XP_370445.1| hypothetical protein MG06942.4 [Magnaporthe gri... 96 4e-19 ref|XP_659856.1| hypothetical protein AN2252.2 [Aspergillus nidu... 62 4e-09 emb|CAD99064.1| SPBC1709.20 [Schizosaccharomyces pombe] >gi|6801... 39 0.069 PGUG_01796.1 | Candida guilliermondii predicted protein (transla... 34 1.8

/.../ Comment: Only fungal proteins in the first round as in BLAST. No profile yet.

PSI-BLAST example (2) Results from round 2 Score E Sequences producing significant alignments: (bits) Value Sequences used in model and found again:

gb|EAA77253.1| hypothetical protein FG07394.1 [Gibberella zeae P... 218 5e-56 ref|XP_370445.1| hypothetical protein MG06942.4 [Magnaporthe gri... 205 3e-52 emb|CAD70969.1| hypothetical protein [Neurospora crassa] >gi|324... 203 1e-51 ref|XP_659856.1| hypothetical protein AN2252.2 [Aspergillus nidu... 144 7e-34

Sequences not found previously or not previously below threshold:

emb|CAD99064.1| SPBC1709.20 [Schizosaccharomyces pombe] >gi|6801... 50 2e-05 gb|AAS53388.1| AFR017Cp [Ashbya gossypii ATCC 10895] >gi|4519853... 46 5e-04 ref|ZP_00744890.1| COG0008: Glutamyl- and glutaminyl-tRNA synthe... 42 0.006 ref|ZP_00755396.1| COG0008: Glutamyl- and glutaminyl-tRNA synthe... 41 0.009 gb|AAF93762.1| glutamyl-tRNA synthetase-related protein [Vibrio ... 41 0.009 ref|ZP_00748964.1| COG0008: Glutamyl- and glutaminyl-tRNA synthe... 41 0.009 ref|ZP_00751512.1| COG0008: Glutamyl- and glutaminyl-tRNA synthe... 41 0.009 ref|ZP_00434095.1| COG0625: Glutathione S-transferase [Burkholde... 41 0.012 ref|XP_454424.1| unnamed protein product [Kluyveromyces lactis] ... 40 0.034 713_pichia_stipitis_FM1.aa.fasta 38 0.083 emb|CAE28370.1| conserved hypothetical protein [Rhodopseudomonas... 37 0.18 ref|XP_341296.2| PREDICTED: similar to ribonuclease P 14kDa subu... 37 0.19

/../ Comment: The constructed profile found more sequences, even a Rpp14.

Page 56: Bioinformatics – Sequence analysis

56

PSI-BLAST example (3) Results from round 4

Score E Sequences producing significant alignments: (bits) Value Sequences used in model and found again:

gb|EAA77253.1| hypothetical protein FG07394.1 [Gibberella zeae P... 179 3e-44 ref|XP_370445.1| hypothetical protein MG06942.4 [Magnaporthe gri... 175 3e-43 emb|CAD70969.1| hypothetical protein [Neurospora crassa] >gi|324... 173 1e-42 ref|XP_454424.1| unnamed protein product [Kluyveromyces lactis] ... 143 2e-33 713_pichia_stipitis_FM1.aa.fasta 140 2e-32 gb|AAS53388.1| AFR017Cp [Ashbya gossypii ATCC 10895] >gi|4519853... 133 2e-30 emb|CAA84837.1| unnamed protein product [Saccharomyces cerevisia... 127 1e-28 emb|CAG88296.1| unnamed protein product [Debaryomyces hansenii C... 123 2e-27 emb|CAD99064.1| SPBC1709.20 [Schizosaccharomyces pombe] >gi|6801... 122 3e-27 ref|XP_659856.1| hypothetical protein AN2252.2 [Aspergillus nidu... 119 3e-26 gb|AAH95792.1| Hypothetical protein LOC553721 [Danio rerio] >gi|... 119 3e-26 ref|XP_447894.1| unnamed protein product [Candida glabrata] >gi|... 107 8e-23

Sequences not found previously or not previously below threshold:

ref|XP_593187.1| PREDICTED: similar to ribonuclease P 14kDa subu... 85 7e-16 ref|NP_080214.1| ribonuclease P 14kDa subunit [Mus musculus] >gi... 82 6e-15

/.../ Comment: Rpp14 seqeunces now have good e-values.

PSI-BLAST example (end) Results from round 5

Score E Sequences producing significant alignments: (bits) Value Sequences used in model and found again:

ref|XP_370445.1| hypothetical protein MG06942.4 [Magnaporthe gri... 165 4e-40 gb|EAA77253.1| hypothetical protein FG07394.1 [Gibberella zeae P... 162 3e-39 ref|NP_008973.1| ribonuclease P 14kDa subunit [Homo sapiens] >gi... 156 2e-37 gb|AAX36190.1| ribonuclease P 14kDa subunit [synthetic construct] 156 2e-37 emb|CAD70969.1| hypothetical protein [Neurospora crassa] >gi|324... 155 4e-37 ref|XP_849188.1| PREDICTED: similar to ribonuclease P 14kDa subu... 152 3e-36 ref|NP_080214.1| ribonuclease P 14kDa subunit [Mus musculus] >gi... 152 4e-36 ref|XP_593187.1| PREDICTED: similar to ribonuclease P 14kDa subu... 151 5e-36 ref|XP_341296.2| PREDICTED: similar to ribonuclease P 14kDa subu... 150 2e-35 gb|AAS53388.1| AFR017Cp [Ashbya gossypii ATCC 10895] >gi|4519853... 144 6e-34 gb|AAH95792.1| Hypothetical protein LOC553721 [Danio rerio] >gi|... 140 1e-32 ref|NP_001017048.1| ribonuclease P 14kDa subunit [Xenopus tropic... 135 4e-31 ref|XP_526214.1| PREDICTED: similar to ribonuclease P 14kDa subu... 133 2e-30 ref|XP_454424.1| unnamed protein product [Kluyveromyces lactis] ... 133 2e-30 emb|CAG88296.1| unnamed protein product [Debaryomyces hansenii C... 133 3e-30 /.../

In the last round we have a mix of fungal (POP8) sequences and Rpp14! This is a good indication that these proteins are related (homologs).

Page 57: Bioinformatics – Sequence analysis

57

Protein secondary structure elements

•  Alpha helix •  Beta strand •  Coil

•  connected Beta strands Beta sheet

•  Beta sheet forming a closed structure that may span a cell membrane (porin) Beta-barrel

Page 58: Bioinformatics – Sequence analysis

58

PSIPRED prediction compared to 3D structure

Structure 2W9J fragment pos 1-91 of SRP14 S.pombe

Not in structure

N-term

Integral membrane proteins

The transmembrane regions are mostly α-helices (Beta-barrels also exist ...)

Length of TM domain approx 20 aa.

GPCRs

Page 59: Bioinformatics – Sequence analysis

59

Transmembrane prediction

•  25% of all proteins are membrane bound •  By comparing known transmembrane proteins,

programs like TMHMM make predictions about where the trans-membrane regions are. “inside”, “TM” and “outside” = HMMs. Probabilities for these 3 are compared.

•  Similar approaches are used for signal peptide and transit peptide predictions, and even for gene predictions ...

TMHMM output (GPCR input)

Seven clearly defined TM domains.

C-terminal is inside cell.

Page 60: Bioinformatics – Sequence analysis

60

TMHMM output RF47_[Guillardia len=68 ExpAA=37.41 First60=32.65 PredHel=2 Topology=i2-19o47-64i ORF74_[Odontella len=74 ExpAA=39.05 First60=32.92 PredHel=2 Topology=i2-24o48-65i ORF71_[Porphyra len=71 ExpAA=36.0 First60=26.14 PredHel=2 Topology=i7-24o53-70i ORF70_[Chlorella len=70 ExpAA=38.67 First60=32.40 PredHel=2 Topology=i2-21o45-67i

-------------------------------------------------------------------------

PredHel=2 (= 2 TM dom) Topology=i2-21o45-67i inside-TRANSMEMBRANE-outside-TRANSMEMBRANE-inside

Example in which scores for first TM domain are too low :

SignalP, signal peptide data

Eukaryotes Total length (average) 22.6 aa n-regions only slightly Arg-rich pos charged h-regions short, very hydrophobic hydrophobic c-regions short, no pattern neutral, polar -3,-1 positions small and neutral residues +1 to +5 region no pattern

Blue:Positively charged residues Red:Negatively charged residues Green:Neutral polar residues Black:Hydrophobic residues

Page 61: Bioinformatics – Sequence analysis

61

Signal peptide/anchor prediction by SignalP

>TXN4_HUMAN Prediction: Signal peptide Signal peptide probability: 0.984 Signal anchor probability: 0.015 Max cleavage site probability: 0.962 between pos. 29 and 30

Scores for the n-region, h-region, and c-region of the signal peptide plus cleavage prediction

SignalP: No signal peptide, but anchor

>sp_Q93127_GPR18_BALAM Prediction: Signal anchor Signal peptide probability: 0.000 Signal anchor probability: 0.969

No h- and c-regions no cleaved peptide

TM domain = anchor?

Page 62: Bioinformatics – Sequence analysis

62

SignalP: Non-secretory protein

>BM2K_HUMAN Prediction: Non-secretory protein Signal peptide probability: 0.157 Signal anchor probability: 0.023 Max cleavage site probability: 0.027 between pos. 28 and 29

TargetP: transit-peptide, signalpep

http://www.cbs.dtu.dk/services/TargetP/

Nuclear encoded proteins destined for an mitochodrion or plastid have a “transit-peptide” that directs the protein to the organelle.

Page 63: Bioinformatics – Sequence analysis

63

TargetP: transit-peptide prediction mito, chloro ...?