sequence similarity search glance to the protein world

45
Sequence similarity search Glance to the protein world

Post on 21-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence similarity search Glance to the protein world

Sequence similarity search

Glance to the protein world

Page 2: Sequence similarity search Glance to the protein world

WHATS TODAY?

• BLASTing Proteins

- Similarity scores for protein sequences

- Advanced BLAST (PSI BLAST)

Page 3: Sequence similarity search Glance to the protein world

Protein Sequence AlignmentRule of thumb:Rule of thumb:Proteins are homologous if 25% identical (Proteins are homologous if 25% identical (length >100length >100))DNA sequences are homologous if 70% identicalDNA sequences are homologous if 70% identical

Page 4: Sequence similarity search Glance to the protein world

Protein Pairwise Sequence Alignment

• The alignment tools are similar to the DNA alignment tools• BLASTN for nucleotides • BLASTP for proteins

• Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores:• Score s(i,j) > 0 if amino acids i and j have similar

properties • Score s(i,j) is 0 otherwise

• How should we score s(i,j)?

Page 5: Sequence similarity search Glance to the protein world

The 20 Amino Acids

Page 6: Sequence similarity search Glance to the protein world

Chemical Similarities Between Amino Acids

Acids & Amides DENQ (Asp, Glu, Asn, Gln)

Basic HKR (His, Lys, Arg)

Aromatic FYW (Phe, Tyr, Trp)

Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr)

Hydrophobic ILMV (Ile, Leu, Met, Val)

Page 7: Sequence similarity search Glance to the protein world

Sequence Alignment based on AA similarity

TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS|| + |||| +|| ||| | +| | | | |TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS RFSGSGSGTDFTLTINSLQPEDFATYYCQ---------------QSYSTPHFSQGTKLEI | | | +| | | +|+ || || |+ + | | || | + RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL

---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN---------NFYPREAKVQWKVD ++||| | + ++ | | | + ||++|+| TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID

| = identity + = similarity

Page 8: Sequence similarity search Glance to the protein world

Amino Acid Substitutions Matrices

• When scoring protein sequence alignments it is common to use a matrix of 20 20, representing all pairwise comparisons :

-Score Matrix

-Substitution Matrix

Page 9: Sequence similarity search Glance to the protein world

Scoring Matrices

• Scoring Matrix -match/mismatch score – Not bad for similar sequences– Does not show distantly related sequences

• Substitution matrix– Scores residues dependent upon likelihood

substitution is found in nature– More applicable for amino acid sequences

Page 10: Sequence similarity search Glance to the protein world

Given an alignment of closely related sequences we can score the relation between amino acidsbased on how frequently they substitute each other

In this column

E & D are found

7/8

M G Y D EM G Y D EM G Y E EM G Y D EM G Y Q EM G Y D EM G Y E EM G Y E E

Substitution Matrix

Page 11: Sequence similarity search Glance to the protein world

C H+H3N

COO-

HCH

C

O-O

C H+H3N

C

COO-

HCH

O-O

HCH

Aspartate(Asp, D)

Glutamate(Glu, E)

D / E

Page 12: Sequence similarity search Glance to the protein world

PAM - Point Accepted Mutations• Developed by Margaret Dayhoff, 1978.• Analyzed very similar protein sequences “Accepted” mutations – do not negatively affect a

protein’s fitness

• Used global alignment.Counted the number of substitutions (i,j) per amino acidpair: Many i<->j substitutions => high score s(i,j)

Page 13: Sequence similarity search Glance to the protein world

Basic matrixnormalized probabilities multiplied by 10000

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901

Page 14: Sequence similarity search Glance to the protein world

Log Odds Matrices

• PAM matrices converted to log-odds matrix– Calculate odds ratio for each substitution

• Taking scores in previous matrix• Divide by frequency of amino acid

– Convert ratio to log10 and multiply by 10– Take average of log odds ratio for converting A to B

and converting B to A– Result: Symmetric matrix

Page 15: Sequence similarity search Glance to the protein world

PAM250 Log odds matrix

Entry (i,i) is greater than any entry (i,j), ji.

Entry (i,j): the score of aligning amino acid i against amino acid j.

Simliar aa have high score

Page 16: Sequence similarity search Glance to the protein world

Selecting a PAM Matrix

• There are different PAM matrices (PAM 1- PAM250). The matrices are derived from each other by multiplying the PAM1 matrices N times

• Low PAM numbers: short sequences, strong local similarities.

• High PAM numbers: long sequences, weak similarities.– PAM120 recommended for general use (40% identity)– PAM60 for close relations (60% identity)– PAM250 for distant relations (20% identity)

• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended

Page 17: Sequence similarity search Glance to the protein world

BLOSUM• Blocks Substitution Matrix

– Steven and Jorga G. Henikoff (1992)

• Based on BLOCKS database (www.blocks.fhcrc.org)

– Families of proteins with identical function

– Highly conserved protein domains

• Ungapped local alignment to identify motifs– Each motif is a block of local alignment

– Counts amino acids observed in same column

– Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

Page 18: Sequence similarity search Glance to the protein world

BLOSUM Matrices

• Different BLOSUMn matrices are calculated independently from BLOCKS

• BLOSUMn is based on blocks that are at most n percent identical.

Page 19: Sequence similarity search Glance to the protein world

Selecting a BLOSUM Matrix

• For BLOSUMn, higher n suitable for sequences which are more similar– BLOSUM62 recommended for general use– BLOSUM80 for close relations– BLOSUM45 for distant relations

Page 20: Sequence similarity search Glance to the protein world

Summary:

• BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps =Loacl alignment

• PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions

BLAST uses BLOSUM62 as a defaultREMEMBER !!!! you can always change it

Page 21: Sequence similarity search Glance to the protein world

Gap penalty in protein alignments

• Scoring for gap opening & for extension

Depends on the substitution matrix used

• Default gap parameters are given for each matrix:

– PAM30: open=9, extension=1

– PAM250: open=14, extension=2

Page 22: Sequence similarity search Glance to the protein world

Remote homologues

• Sometimes BLAST isn’t enough.

• Large protein family, and BLAST only gives close members. We want more distant members

PSI-BLAST

Page 23: Sequence similarity search Glance to the protein world

PSI-BLAST

[1] Select a query and search it against a protein database

[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

Page 138

Page 24: Sequence similarity search Glance to the protein world

R,I,K C D,E,T K,R,T N,L,Y,G

Page 25: Sequence similarity search Glance to the protein world

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

Page 26: Sequence similarity search Glance to the protein world

A R N D C Q E G H I L K M F P S T W Y V 1 M -1 -2 -2 -3 -2 -1 -2 -3 -2 1 2 -2 6 0 -3 -2 -1 -2 -1 1 2 K -1 1 0 1 -4 2 4 -2 0 -3 -3 3 -2 -4 -1 0 -1 -3 -2 -3 3 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 4 V 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 4 5 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 6 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 7 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 8 L -1 -3 -3 -4 -1 -3 -3 -4 -3 2 2 -3 1 3 -3 -2 -1 -2 0 3 9 L -1 -3 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 2 10 L -2 -2 -4 -4 -1 -2 -3 -4 -3 2 4 -3 2 0 -3 -3 -1 -2 -1 1 11 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 12 A 5 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0 13 W -2 -3 -4 -4 -2 -2 -3 -4 -3 1 4 -3 2 1 -3 -3 -2 7 0 0 14 A 3 -2 -1 -2 -1 -1 -2 4 -2 -2 -2 -1 -2 -3 -1 1 -1 -3 -3 -1 15 A 2 -1 0 -1 -2 2 0 2 -1 -3 -3 0 -2 -3 -1 3 0 -3 -2 -2 16 A 4 -2 -1 -2 -1 -1 -1 3 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 -1 ... 37 S 2 -1 0 -1 -1 0 0 0 -1 -2 -3 0 -2 -3 -1 4 1 -3 -2 -2 38 G 0 -3 -1 -2 -3 -2 -2 6 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4 39 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -2 0 40 W -3 -3 -4 -5 -3 -2 -3 -3 -3 -3 -2 -3 -2 1 -4 -3 -3 12 2 -3 41 Y -2 -2 -2 -3 -3 -2 -2 -3 2 -2 -1 -2 -1 3 -3 -2 -2 2 7 -1 42 A 4 -2 -2 -2 -1 -1 -1 0 -2 -2 -2 -1 -1 -3 -1 1 0 -3 -2 0

Page 27: Sequence similarity search Glance to the protein world

PSI-BLAST

[1] Select a query and search it against a protein database

[2] PSI-BLAST constructs a multiple sequence alignmentthen creates a “profile” or specialized position-specificscoring matrix (PSSM)

[3] The PSSM is used as a query against the database

[4] PSI-BLAST estimates statistical significance (E values)

[5] Repeat steps [3] and [4] iteratively, typically 5 times.At each new search, a new profile is used as the query.Page 138

Page 28: Sequence similarity search Glance to the protein world

Searching for remote homology using PSI-BLAST

Page 29: Sequence similarity search Glance to the protein world

The universe of lipocalins (each dot is a protein)

retinol-binding protein

odorant-binding protein

apolipoprotein D

Retinol binding Protein

B-lactoglubolin

Page 30: Sequence similarity search Glance to the protein world

Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)

Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82

Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135

Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPESbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158

PSI-BLAST alignment of RBP (retinol binding protein)and -lactoglobulin: iteration 1

Example is taken from Bioinformatics and Functional Genomicsby Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

Page 31: Sequence similarity search Glance to the protein world

PSI-BLAST alignment of RBP and -lactoglobulin: iteration 2

Score = 140 bits (353), Expect = 1e-32Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%)

Query: 4 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55 V L+ LA A + +F V+ENFD ++ G WY + +K P +Sbjct: 2 VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60

Query: 56 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112 I A +S+ E G + K + D + V ++ +PAK +++++ + Sbjct: 61 CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112

Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159

Page 32: Sequence similarity search Glance to the protein world

PSI-BLAST alignment of RBP and -lactoglobulin: iteration 3

Score = 159 bits (404), Expect = 1e-38Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)

Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59

Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112

Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Page 33: Sequence similarity search Glance to the protein world

Score = 159 bits (404), Expect = 1e-38Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)

Query: 3 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K Sbjct: 1 MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59

Query: 55 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + Sbjct: 60 NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112

Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159

Score = 46.2 bits (108), Expect = 2e-04Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)

Query: 27 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ Sbjct: 33 VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82

Query: 87 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC Sbjct: 83 PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135

Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPESbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158

1

3

Page 34: Sequence similarity search Glance to the protein world

The universe of lipocalins (each dot is a protein)

retinol-binding protein

odorant-binding protein

apolipoprotein D

Page 35: Sequence similarity search Glance to the protein world

Scoring matrices let you focus on the big (or small) picture

retinol-binding protein

Page 36: Sequence similarity search Glance to the protein world

Scoring matrices let you focus on the big (or small) picture

retinol-binding proteinretinol-binding

protein

PAM250

PAM30

Blosum45

Blosum80

Page 37: Sequence similarity search Glance to the protein world

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM

retinol-binding protein

retinol-binding protein

Page 38: Sequence similarity search Glance to the protein world

PSI-BLAST-PSI-BLAST is useful to detect weak but biologicallymeaningful relationships between proteins.

-The main source of false positives is the spuriousamplification of sequences not related to the query.

-Once even a single spurious protein is includedin a PSI-BLAST search above threshold, it will notgo away.

Page 144

Page 39: Sequence similarity search Glance to the protein world

PSI-BLASTThree approaches to prevent false positive results:

[1] Apply filtering

[2] Adjust E value to a lower value

[3] Visually inspect the output from each iteration. Remove suspicious hits.

Page 144

Page 40: Sequence similarity search Glance to the protein world

PHI-BLASTSearching a specific sequence pattern with local alignments surrounding the match.

Page 145

PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology.

EXAMPLE:Search for a short sequence motif in the lipocalin family

Page 41: Sequence similarity search Glance to the protein world

PHI-BLAST

Given 1) protein sequence S2) pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences?

Page 145

Page 42: Sequence similarity search Glance to the protein world

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

Align three lipocalins (RBP and two bacterial lipocalins)

Page 43: Sequence similarity search Glance to the protein world

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

GTWYEI K AV M

Concentrate on the conserved region of interest and see which amino acid residues are used

Page 44: Sequence similarity search Glance to the protein world

1 50ecblc MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD vc MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLDhsrbp ~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD

GTWYEI K AV M

GXW[YF][EA][IVLM]

Create a pattern using the appropriate syntax

Page 45: Sequence similarity search Glance to the protein world

Results