talk overview sequence motifs, correlations and structural ... · • given structural...

Sequence motifs, correlations and structural mapping of

evolutionary data

Eran EyalMarch 2011

Talk overview• Sequence profiles – position specific scoring matrix

• Psi-blast. Automated way to create and use sequence profiles in similarity searches

• Sequence patterns and sequence logos

• Bioinformatic tools which employ sequence profiles:PFAMBLOCKSPROSITEPRINTSInterPro

• Correlated Mutations and structural insight

• Mapping sequence data on structures:ConservationsCorrelations

PSSM – position specific scoring matrix

• A position-specific scoring matrix (PSSM) is a commonly used

representation of motifs (patterns) in biological sequences

• PSSM enables us to represent multiple sequence alignments as

mathematical entities which we can work with.

• PSSMs enables the scoring of multiple alignments with sequences,

or other PSSMs.

PSSM – position specific scoring matrix

nssssS ...321=

∑=

=n

jjs j

mscorealignment1

,_

where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast).

Assuming a string S of length n

If we want to score this string against our PSSM of length n (with n lines):

PSI-BLAST

• For a query sequence use Blast to find matching

sequences.

• Construct a multiple sequence alignment from the hits to

find the common regions (consensus).

• Use the “consensus” to search again the database, and

get a new set of matching sequences

• Repeat the process !

Sequence space

Sequence space Position-Specific-Iterated-BLAST

• Intuition– substitution matrices should be specific to sites and

not global. – Example: penalize alanine→glycine more in a helix

• Idea– Use BLAST with high stringency to get a set of

closely related sequences. – Align those sequences to create a new substitution

matrix for each position. – Then use that matrix to find additional sequences.

• Cycling/iterative method– Gives increased sensitivity for detecting distantly related

proteins– Can give insight into functional relationships– Very refined statistical methods

• Fast and simple

Position-Specific-Iterated-BLAST PSI-BLAST Principle

• First, a standard blastp is performed• The highest scoring hits are used to generate a multiple

alignment• A PSSM is generated from the multiple alignment. • Another similarity search is performed, this time using

the new PSSM• Repeat previous steps until convergence (no new

sequences appear after iteration)

Sequence space Example:Aminoacyl tRNA Synthetases

• Each is very different– Aminoacyl tRNA Synthetases are very different: size, multimers,

etc…– But all bind to their own tRNAs and amino acids with high

specificity.• TrpRS and TyrRS share only 13% sequence identity

– Yet the structures of TrpTRS and TyrTRS are similar– Structure Function relationship (See ellipsoid slide from

previous lecture…)

Same SCOP family based on catalytic domain

Overall structure similarity noted

• Given structural similarities, we would expect to find sequence similarity…

• However, blastp of E.coli TyrRS against bacterial sequences in SwissProt does NOT show similarity with TrpRS at e-value cutoff of 10

No TrpRS!

TrpRS Similarity to TyrRS!

After a few iterations…

PSI-BLAST

– Be sure to inspect and think about the results included in the PSSM build

– include/exclude sequences on basis of biological knowledge: you are in the driving seat!

– PSI-BLAST performance varies according to choice of matrix, filter, statistics and nature of data just like any other alignment tool.

Using PSI-BLAST

• PSI-BLAST available from BLAST web sites

• Query form just like for blastp

– BUT: one extra formatting option must be used

– A special e-value cutoff used to determine which alignments will be used for PSSM build.

– PSI-BLAST also available from the stand alone versions of BLAST.

Why (not) PSI-BLAST

• If the sequences used to construct the Position Specific Scoring Matrices (PSSMs) are true homologous, the sensitivity at a given specificity improves significantly.

• However, if non-homologous sequences are included in the PSSMs, they are “corrupted.” Then they pull in false non-homologous sequences and will amplify the errors in the next rounds.

• If all hits in the first rounds are highly similar, then the prediction power of the new PSSM will not be significantly better than of the original substitution matrix

Query

Does the query really have a relationship with

the results?

PSI-BLAST caveat

• Increased ability to find distant homologues• Cost of additional required care to prevent non-

homologous sequences from being included in the PSSM calculation– When in doubt, leave it out!– Examine sequences with moderate similarity carefully.

• Be particularly cautious about matches to sequences with highly biased amino acid content– Low complexity regions, transmembrane regions and coiled-coil

regions often display significant similarity without homology– Screen them out of your query sequences!

PSI-BLASTon the command line

• As with simple BLAST searches, using PSI-BLAST on the command line gives the user more power

• Opens up additional options, e.g. – PSI-BLASTing over nucleotide databases– automating number of iterations– trying out lots of different settings in parallel– inputting multiple sequences

PFAM – Database of Protein families represented by HMM

The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and hiddenMarkov models (HMMs).

There are two levels of quality to Pfam families: Pfam-A and Pfam-B. For each Pfam-A family Pfam builds a single curated profile hidden Markov model (HMM) from a seed alignment (a small set of representative members of the family). Pfam-B families have no associated annotation or literature reference and are of much lower quality than Pfam-A families.

Release 24.0 has 11912 families

For each Pfam accession there is a family page, which can be accessed in several ways.

• The HMM are generated using the HMMER3 program, which is a new and efficient HMM builder.

• There is a new option to search single DNA sequences against the library of Pfam HMMs

• HMM models can be downloaded, as well as the multiple alignments of the seed and full alignments used to create the models.

Prositehttp://www.expasy.org/prosite

• PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns. One can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs.

• In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residues which is variously known as a pattern, motif, signature, or fingerprint.

http://expasy.org/prosite/

A Pattern in our context is a Protein WORD conserved in many sequences:

PVAILL

What is a sequence pattern

A pattern lets you identify a protein family

Prosite patterns can describe complex signatures

This reads as follows:

“an Arginine or a Lysine, followed by one random residue, followed by a Serine or a Threonine”

[RK]-x-[ST]

C-[DES]-x-C-x(3)-I-x(3)-R-x(4)-P-x(4)-C-x(2)-C Is a signature for Zn finger proteins which bind DNA

MALRAGLVLG FHTLMTLLSP QEAGATKADH MGSYGPAFYQ SYGASGQFTH EFDEEQLFSV DLKKSEAVWR LPEFGDFARF DPQGGLAGIA AIKAHLDILV ERSNRSRAIN VPPRVTVLPK SRVELGQPNI LICIVDNIFP PVINITWLRN GQTVTEGVAQ TSFYSQPDHL FRKFHYLPFV

Using PrositeScanhttp://expasy.org/tools/scanprosite/

Using PrositeScan Using PROSITE-Scan: Structure

Printshttp://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php

• PRINTS is a compendium of protein fingerprints which are conserved motifs used to characterize a protein family.• Release 41.1 of PRINTS contains 2050 entries, encoding 12,121 individual motifs.• Two types of fingerprint are represented in the database: simple or composite. simple fingerprints are essentially single-motifs; while composite fingerprints encode multiple motifs. • Most entries are of the latter type because discrimination power is greater for multi-component searches, and results are easier to interpret.

Direct PRINTS access: By accession number By PRINTS code By database code By text By sequence By title By number of motifs By author By query language

Sequence logos

• A sequence logo is a graphical representation of aligned sequences where at each position the size of each residue is proportional to its frequency in that position and the total height of all the residues in the position is proportional to the conservation (information content) of the position

Blocks

• Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.• The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins.• Blocks is not updated any more. The last version of database (14.3) is from 2007.

InterProhttp://www.ebi.ac.uk/interpro/

• InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes.

• It facilitates prediction for the occurrence of functional domains, repeats and important sites.

• InterPro combines a number of databases (referred to as member databases) that use different methodologies to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).

The member databases use a number of approaches:

- ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST. - PROSITE patterns: provider of simple regular expressions. - PROSITE and HAMAP profiles: provide sequence matrices. - PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs).

-PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs).

Entries typed Family contain signatures that cover all domains in the matching proteins and span >80% of the protein length with no adjacent signatures of type Domain or Region in >90% of the entry protein set. Entries typed Domain identify biological units with defined boundaries, which includes structural and functional domains as well as defined sub-domains.

Correlated mutations

• Approaches to detect residue coupling

• Applications

• Some debates regarding correlated mutations

R

R

EK

DDE

E

KK

DE

R

KVVVVVVV

NNNSSSS

tree determinant

conserved

Coupled“correlated mutations”

Information extracted from multiple sequence alignment (MSA)

sequences

consenzusi

i NNC =

∑=

−=20

1ln

aa

aai

aaii ppC

So what correlated mutations can tell us and where are they useful ?

Basically every evolutionary constrained

Very useful for RNA folding

In proteins:• Contact prediction• Analysis of important interactions• Analysis of allosteric paths and energetically coupled residues

Correlation coefficient Gobel et al. (1994), Proteins

Kass & Horovitz (OMES) Kass and Horovitz (2002), Proteins

SCA Lockless and Ranganathan (1999), Science

Mutual Information

http://bip.weizmann.ac.il/correlated_mutations/

400 amino acid pairs

400

amin

o ac

id p

airs

i, j – positionsws, wt – sequence weightss, t - sequencessi – amino acid found at position i on sequence s

P2P

R

R

EK

DDE

E

KK

DE

R

KVVVVVVV

NNNSSSS

Instead of calculating correlations, we can derive universal scores for substitutions between amino acid pairs

Score (R,E E,K) ?

How to derive such a substitution matrix?

Eyal et al. (2007) Proteins, 67, 142-153

N

R

R

EK

DDE

E

KK

DE

R

KVVVVVVV

NNNSSSS

GGGREKK

LLLILVI

VVVVVV

M RGD

AA

AAASASS

AAASAGG

PPPPA

GG

WWWWYYF

K

w1w2w3w4w5w6w7

Blocks: small un-gapped multiple alignments

Advantage:•Accurate alignments•No gaps•Sequence weights

Representative structures for each block

]][[]][[ln

]][[]][[ln]][[

uvxyfuvxyf

uvxyfuvxyfuvxyM nocon

exp

noconobs

conexp

conobs −=

∑=

abcd

conobs

conobscon

obs cdabnuvxynuvxyf

]][[]][[

]][[

∑∑⋅=

abobs

obs

abobs

obsconexp ban

vynban

uxnuvxyf]][[

]][[]][[

]][[]][[

Signal Noise The pair-to-pair(P2P) substitution matrix

Flipped pairs:

XY YX

Invariant pairs:

XY XY

XY XY

XY XZXY YX

XY WZ

What is the matrix useful for?

• Detect contact between amino acids when there is no structural data

• Evaluate structures

• Detect functional/structural important regions

P2P for contact prediction

Eyal et al. (2007) Proteins, 67, 142-153

P2P for contact prediction

Contacts prediction in smaller proteins is easier

Galectin-7 (1bkz)All prediction are between 2 β-sheets

β-lactamase II (1bc2)Most predictions are around the metal binding site

P2P as a scoring function for structure evaluation

i j

Sij using P2P

i

j

∑=contact withji

ijSS,

overall score:

Advantages of the P2P method over other methods

• No need for large MSAs

• No need to construct evolutionary trees

• Naturally handle conservation and correlations

• Interactive web implementation

u1u3

u2

va4a1

u4

a3

w

a2

Considering correlations together improves contact prediction – the GARP approach

Frankel et al. (2007) BMC Bioinformatics, 67, 142-153

Considering also neighbors and “windows” of correlations may improve predictions of primary correlated mutations methods

Methods based on the evolutionary tree

ADSDDFGRLIILM

ADSDDFGRLIILL ADSDLFGVLIILM

ADSDLFGVLIILLADTDLFGVLIILMADSDDFGRLIILLGDTDDFGRLIILM

2 mutation events

Methods based on the evolutionary tree

ADSDDFGRLIILM

ADSDDFGRLIILL ADSDDFGRLIILM

ADSDLFGVLIILLADTDDFGRLIILMADSDLFGVLIILL GDTDDFGRLIILM

2 mutation events

2 mutation events

GDTDDFGRLIILMADSDDFGRLIILLADTDLFGVLIILMADSDLFGVLIILL

Although the same multiple alignment will be obtained in the 2 cases

It is clear that evolutionary history of multiple independent events is a much stronger indication for real coupling

Methods based on evolutionary tree:

Pagel M. (1994) Proc R Soc LondPollock D, Taylor W. (1997). Protein EngPollock D et al. (1999) J Mol BiolTuffery and Darlu (2000) Mol Biol EvolFleishman S et al. (2004) J Mol BiolNoivirt et al. (2005) Protein Eng

Lockless and Ranganathan (1999), Science, 286, 295-299

Statistical coupling

R

R

EK

DDE

E

KK

DE

K

KVVVVVVV

NNNSSSS

i j

DD

K

KK

K

VVVV

NSSS

i j

MSA

MSA|δj

For every selected j we can measure the coupling to all other sites i

2

|

|20

1

*, )ln(ln aa

MSA

aai

aajMSA

aaji

aa

statji p

ppp

kTG −=ΔΔ ∑= δ

δ

E

E

Statistical coupling

Lockless and Ranganathan (1999), Science, 286, 295-299 PDZ domain Lockless and Ranganathan (1999), Science, 286, 295-299

Russ et al. Nature (2005), 437, 579-583

WW domain

Russ et al. Nature (2005), 437, 579-583

Cooperatively in WW domains

Studies using SCA

Estabrook et al. (2005), PNAS methyltranferases

Marcelino et al. (2006), Proteins intracellular lipid binding proteins (iLBPs)

Swain et al. (2006), Curr Opin Str Biol HSP70 chaperones

Chen et al. (2006), JBC Cys loop ligand-gated ion channels

Dima and Thirumalai (2006), Protein Sci Selectins

Ferguson et al. (2007), PNAS TonB-dependent transporters

Yu et al (2007), Biophys J DNA Helicases

Lee et al. (2008), Science PAS-DHFR

Hsu and Traugh (2010) PLoS One protein kinase Pak2

Is correlated mutations analysis really meaningful?

Which are the leading methods?

Fodor and Aldrich (2004), Proteins Halperin et al. (2006), Proteins

Fodor and Aldrich (2004), Proteins

Can correlated mutations reveal allosteric pathways??

1

2 34

5

6

1

2 45

6

123456

1 2 3 4 5 6123456

1 2 3 4 5 6

1

2 45

6

123456

1 2 3 4 5 6

Fodor and Aldrich (2004), JBC, 279, 19046-19050

A

B

Can CM detect interactions between different molecules?

Halperin et al. (2006), Proteins

intra inter

The main problems in detecting inter protein coupling

• Correct selection of paralogs

• Basic assumptions? Do interfaces are conserved? Do all pairs interact?

• Smaller number of protein complexes and data about interfaces for testing/training

Covariance analysis of Glutamate transporter

• Covariance analysis was performed using different methods

• 989 sequences were extracted from PFAM for the sodium-dicarboxylate symporter family (PF00375).

• Alignment was modified such as the reference numbers are of Human EAAT1 protein.

• Hierarchical clustering was used to analyze the matrices

residues from TM4cresidues from TM2

residues from TM4a

residues in the core region

core region interface regions

Is there a real connection between CM and energetically coupled residues?

Works for PDZ domain

Not so fast….

Bouncing back

Different correlated mutations are appropriate for different tasks

Did Fodor implemented SCA appropriately ???

Dima and Thirumalai (2006), Protein Sci 15, 258-268

Number of sequences remaining after the perturbation should be consideredThe matrix is not symmetrical, but considered as such by Fodor.

McBascP2P

OMESMISCA

Correlations in HIV-1 Protease• cleavage of premature polypeptides to form

the proteins required by the virus• a major drug target in AIDS therapies• exhibits multi-drug resistance• large amount of data available

– sequence databases– many solved structures– clinical information and known drug resistant

mutations

Dataset and MSAs

Data source: http://hivdb.stanford.edu/

IDV = indinavir, protease inhibitorNFV = nelfinavir, protease inhibitor

Mutual Information

• Mutual information measures the dependencebetween two random variables

• Suppose Xi and Xj are two random variables. The mutual information between Xi and Xj is defined as

where xi and xj are specific amino acid types

joint probability singlet (marginal) probability

Two Extreme Cases

• when Xi and Xj are independent

• when Xi and Xj follow exactly the same distribution

Mutual Information Matrix

• By calculating I(Xi, Xj) for all pairs of i, j, we obtain an N×N mutual information matrix W with element I(Xi, Xj).

• In our case N = 99.

Clustering

Why do clustering?– The origin of the correlations is not always

pairwise, but most available statistical methods are based on pairwise metrics. Clustering helps in detecting more integrated patterns.

– Enhance signal over noise (S/N ratio) (Noivirt et al., Protein Eng. Des. Sel., 2005)

Spectral Clustering

Scheme A Scheme B

• A graph segmentation algorithm

(Shi and Malik, IEEE Trans, 2000)

Cut = S….

Spectral Clustering

• Minimize the normalized cut between two groups

• assoc(A, V) is the total weight of connection from A to all nodes in the graph

Back to the Protein

• Each column in the MSA corresponds to a residue, which in turn is represented as a node in the graph.

• The mutual information is the weight of edge between node i and j.mutual information matrix = weight matrix

Residue i Residue j

mutual information between Xi and Xj

Spectral Clustering

• The problem reduces to solving a generalized eigenvalue problem

where D is a diagonal matrix with element

W is the mutual information weight matrix

• The eigenvector with the first nonzero eigenvalue is used to bi-partition the nodes

Sequence Correlation Matrixand its Permutation Based on Clustering

treated data

untreateddata

Results• two clusters were distinguished based on

spectral clustering procedure• one of the cluster (blue) contains residues known

to be involved in multi-drug resistance• the other cluster (red) contains residues that

exhibit substantial sequence variability between subtypes of HIV.

Gonzales et al.J. Infec. Dis.2001

Cooperative Coupling Relation between Sequence Variability and Protein Dynamics (GNM)

Relation between Sequence Variability and Protein Dynamics (cont)

the two clustering partitions

mobilities

• Covariance analysis detects the drug-resistance mutations and their cooperativity in HIV-1 protease in agreement with experimental data.

• Clustering techniques can be applied to analyse the data. • Relationship is elucidated between coevolving residue

clusters and the collective dynamics of the protease.

Correlated mutations - summary

• Correlated mutations analysis is a simple tool to detect coupling between residues. The tremendous amount of available sequences makes it more attractive

• The tremendous amount of available sequences makes it more attractive

• Current methods can assist in detection of close tertiary contacts

• Depends on the application different CM methods should be applied

• Relation between sets of correlated paths has been suggested but not always in a consistent and convincing ways.

• Relation between CM and free-energy has been suggested but shown not to hold on a consistent basis

Correlated mutations - future directions

• Improve MSA – filtering out sequences

• N-body correlations instead of pair-wise correlations

• Improved clustering techniques

• ConSurf is a tool developed in TAU for mapping conservation scores on protein structures (and recently nucleic acid) structures.

• Detailed understanding of the mechanism of biological processes requires the identification of functionally important amino acids at the protein surface that are responsible for these interactions

• ConSurf server is a useful and user-friendly tool that enables the identification of functionally important regions on the surface of a protein of known three-dimensional structure, based on the evolutionary analysis.

ConSurf – mapping conservation scoreson 3D structures

Ashkenazy H., Erez E., Martz E., Pupko T. and Ben-Tal N. 2010 ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids.Nucl. Acids Res (2010)

http://consurf.tau.ac.il/

• Given the 3D-structure of a protein or a domain as an input, ConSurfextracts the sequence from the PDB .

• It then carries out a search for close homologous sequences of the protein of known structure using PSI-BLAST.

• Multiple sequence alignment is done using MUSCLE or CLUSTALW. The multiple sequence alignment is used to build a phylogenetic tree.

• Conservation scores are calculated based on Bayesian or Maximum Likelihood method.

• The protein, with the conservation scores color-coded onto its surface, can finally be visualized on-line using Jmol.

ConSurf – mapping conservation scoreson 3D structures

talk overview sequence motifs, correlations and structural ... · • given structural...

Documents