1-month practical course genome analysis 2008 lecture 3: profiles: representing sequence alignment

Post on 11-Jan-2016

16 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. 1-month Practical Course Genome Analysis 2008 Lecture 3: Profiles: representing sequence alignment Centre for Integrative Bioinformatics VU (IBIVU) - PowerPoint PPT Presentation

TRANSCRIPT

1-month Practical CourseGenome Analysis 2008

Lecture 3: Profiles: representing sequence alignment

Centre for Integrative Bioinformatics VU (IBIVU)Vrije Universiteit AmsterdamThe Netherlandsibivu.nl heringa@cs.vu.nl

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Alignment input parametersScoring alignments

10 1

Amino Acid Exchange Matrix

Gap penalties (open, extension)

2020

A number of different schemes have been developed to compile residue exchange matrices

However, there are no formal concepts to calculate corresponding gap penalties

Emperically determined values are recommended for PAM250, BLOSUM62, etc.

But how can we align blocks of sequences ?

AB

CD

ABCD

E

?

The dynamic programming algorithm performs well for pairwise alignment (two axes).

So we should try to treat the blocks as a “single” sequence …

How to represent a block of sequences

Historically: consensus sequence single sequence that best represents the amino acids observed at each alignment position.

Modern methods: alignment profile representation that retains the information about frequencies of amino acids observed at each alignment position.

Consensus sequence

Problem: loss of information

For larger blocks of sequences it “punishes” more distant members

Sequence 1

F A T N M G T S D P P T H T R L R K L V S Q

Sequence 2

F V T N M N N S D G P T H T K L R K L V S T

Consensus F * T N M * * S D * P T H T * L R K L V S *

Alignment profiles

Advantage: full representation of the sequence alignment (more information retained)

Not only used in alignment methods, but also in sequence-database searching (to detect distant homologues)

Also called PSSM in BLAST (Position-specific scoring matrix)

Multiple alignment profilesMultiple alignment profiles

ACDWY

-

i

fA..fC..fD..fW..fY..Gapo, gapxGapo, gapx

Position-dependent gap penalties

Core region Core regionGapped region

Gapo, gapx

fA..fC..fD..fW..fY..

fA..fC..fD..fW..fY..

frequencies

Profile buildingProfile building Example: each aa is represented as a frequency and gap penalties as weights.

ACDWY

Gappenalties

i0.30.100.30.3

0.51.0Position dependent gap penalties

0.50000.5

00.50.20.10.2

1.0

Profile-sequence alignmentProfile-sequence alignment

ACD……VWY

sequence

Sequence to profile alignmentSequence to profile alignment

AAVVL

0.4 A

0.2 L

0.4 V

Score of amino acid L in a sequence that is aligned against this profile position:

Score = 0.4 * s(L, A) + 0.2 * s(L, L) + 0.4 * s(L, V)

Profile-profile alignmentProfile-profile alignment

ACD..Y

ACD……VWY

profile

profile

General function for profile-profile General function for profile-profile scoringscoring

At each position (column) we have different residue frequencies for each amino acid (rows)

Instead of saying S=s(aa1, aa2) for pairwise alignment For comparing two profile positions we take:

ACD..Y

Profile 1ACD..Y

Profile 2

20

i

20

jjiji )aa,s(aafaafaaS

Profile to profile alignmentProfile to profile alignment

0.4 A

0.2 L

0.4 V

Match score of these two alignment columns using the a.a frequencies at the corresponding profile positions:

Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) +

+ 0.4*0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S)

s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for amino acid pair (x,y)

0.75 G

0.25 S

top related