reminder -structure of a genome human 3x10 9 bp genome: ~30,000 genes ~200,000 exons ~23 mb coding...

Reminder -Structure of a genome

Human 3x109 bpGenome: ~30,000 genes

~200,000 exons ~23 Mb coding ~15 Mb noncoding

pre-mRNA

transcription

splicing

translationmature mRNA

protein

a gene

Sequence Alignment

We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its three dimensional structure.

We want to compare the linear sequence between various genes, in order to deduce function, phylogeny, structure,origin…

The level of similarity is the homology

The Problem

Biological problemFinding a way to compare and represent similarity

or dissimilarity between biomolecular sequences (DNA, RNA or amino acid)

Computational problemFinding a way to perform inexact or

approximate matching of subsequences within strings of characters

Statistical problemHow to estimate the validity of our results

Course plan (for the next three weeks)

Details of biology Estimate of computation time Dynamic programming algorithm for full an local

alignment Statistical analysis of results Dot matrices and heuristics for alignment Distance matrices and information theory (MSA)

Homology

Similarity due to descent from a common ancestor

Homologous sequences can be identified through sequence alignment

Thus, possible to predict/infer structure or function from primary sequence analysis

Gaps

Sequences may have diverged from common ancestor through mutations: Substitution (AAGC AAGT) Insertion (AAG AAGT) Deletion (AAGC AAG)

Latter two operations result in gaps ( _ ) K contiguous spaces = gap of length K ( > 0 )

Similarity and Alignment

Similarity has two aspects: Quantitative aspect: Similarity measure

A number that represents degree of similarity Example: a score indicating 10% match between 2 DNA

sequences. Qualitative aspect: An alignment

a mutual arrangement of two sequences that shows where the two sequences are similar, and where they differ. An optimal alignment is one that exhibits the most correspondences, and the least differences.

Example: a b c d e – h z

a b w d e f h _

The Edit Distance between two strings Definition:

The edit distance between two strings is defined as the minimum number of edit operations – insertions, deletions, & substitutions – needed to transform the first string into the second. For emphasis, note that matches are not counted.

Example: AATT and AATG

Distance = 1 (edit operation of substitution)

String alignment

An edit transcript is a way to represent a particular transformation of one string into another Emphasizes point mutations in the model

An alignment displays a relationship between two strings Global alignment means for each string, entire string is

involved in the alignment Examples:

(1) A A G C A (2) GSAQVKGHGKKVADAL …. A A _ C _ ++ ++++H+ KV + …. NNPELQAHAGKVFKLV ….

Alignment vs. Edit Transcript

Essentially equivalent: Two opposing characters in an alignment

a substitution in edit transcript A gap or space in an alignment in first string

an insertion of opposing character A gap or space in second string

a deletion of opposing character

product vs. process

Gap cost or penalty functions

Observation: Gap of length k more probable than k gaps of

length 1Cause might be single mutational eventSeparated gaps probably arose due to different events

Gap penalty functions: Linear cost: Treats both cases uniformly Common to use a higher cost for h for opening a

gap and a lower cost g for extending a gap

Pairwise Sequence Alignment

Example

Which one is better?

HEAGAWGHEEPAWHEAE

HEAGAWGHE-E

P-A--W-HEAE

HEAGAWGHE-E

--P-AW-HEAE

Example

AEGHW

A5-10-2-3

E-16-30-3

H-20-210-3

P-1-1-2-2-4

W-3-3-3-315

• Gap penalty: -8

• Gap extension: -3

HEAGAWGHE-E

P-A--W-HEAE

HEAGAWGHE-E

--P-AW-HEAE(-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9

Exercise: Calculate for

Formal Description

Problem: PairSeqAlign Input: Two sequences x,y Scoring matrix s Gap penalty d Gap extension penalty e Output: The optimal sequence alignment

How Difficult Is This?

Given two sequences of length m and n. How many alignments are there? f(m,n) How many non-equivalent alignments are

there ? g(m,n)

F(n,m)

F(n,m)=f(n-1,m)+f(n,m-1)+f(n-1,m-1)012345

01111111135791121513254161317256312923141941129321681511161231681168361138537712893653711511357522417183811714583336491307391191811159564122363

F(n,m)

12)21(n)f(n, lim

m)f(n,

1m)f(0,f(n,0)f(0,0)

1)-m1,-f(n1)-mf(n,m)1,-f(nm)f(n,

n

nm

ml

n

ml

n

n

l

F(n-1,m-1)F(n,m-1)

F(n-1,m)F(n,m)

G(n,m)

012345

0111111112345621361015213141020355641515357012651621561262526172884210462718361203307928194516549512879110552207152002

G(n,m)

n

n

nm

n

22n)g(n, lim

m)g(n,

1m)g(0,g(n,0)g(0,0)

1)-mg(n,m)1,-g(nm)g(n,

g(n-1,m-1)g(n,m-1)

g(n-1,m)g(n,m)

So what?

So at n = 20, we have over 120 billion possible alignments

We want to be able to align much, much longer sequences Some proteins have

1000 amino acids Genes can have several

thousand base pairs

Dynamic Programming

General algorithmic development technique Reuses the results of previous computations

Store intermediate results in a table for reuse

Look up in table for earlier result to build from

Global Alignment

Needleman-Wunsch 1970 Idea: Build up optimal alignment from optimal alignments

of subsequencesHEAG

--P-

-25

HEAGA

--P-A

-20

HEAGA

--P—

-33

HEAG-

--P-A

-33

Add score from table

Gap with bottom Gap with top Top and bottom

Global Alignment

Notation xi – ith letter of string x yj – jth letter of string y x1..i – Prefix of x from letters 1 through I F – matrix of optimal scores

F(i,j) represents optimal score lining up x1..i with y1..j

d – gap penalty s – scoring matrix

Global Alignment

The work is to build up F Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd Fill from top left to bottom right using the recursive

relation

)(),(min

)(),((min

),()1,1(

max),(

kgapkjiF

kgapjkiF

yxsjiF

jiF

k

k

ji

Global Alignment

F(i-1,j-1)F(i,j-1)

F(i-1,j)F(i,j)

s(xi,yj) d

d

Move ahead in both

xi aligned to gap

yj aligned to gap

While building the table, keep track of where optimal score came from, reverse arrows

Example

HEAGAWGHEE

0-8-16-24-32-40-48-56-64-72-80

P-8-2-9-17-25-33-42-49-57-65-73

A-16

W-24

H-32

E-40

A-48

E-56

Completed Table

HEAGAWGHEE

0-8-16-24-32-40-48-56-64-72-80

P-8-2-9-17-25-33-42-49-57-65-73

A-16-10-3-4-12-20-28-36-44-52-60

W-24-18-11-6-7-15-5-13-21-29-37

H-32-14-18-13-8-9-13-7-3-11-19

E-40-22-8-16-16-9-12-15-73-5

A-48-30-16-3-11-11-12-12-15-52

E-56-38-24-11-6-12-14-15-12-91

ScoreGap –8Error –2Fit +6

Traceback

HEAGAWGHEE

0-8-16-24-32-40-48-56-64-72-80

P-8-2-9-17-25-33-42-49-57-65-73

A-16-10-3-4-12-20-28-36-44-52-60

W-24-18-11-6-7-15-5-13-21-29-37

H-32-14-18-13-8-9-13-7-3-11-19

E-40-22-8-16-16-9-12-15-73-5

A-48-30-16-3-11-11-12-12-15-52

E-56-38-24-11-6-12-14-15-12-91 HEAGAWGHE-E--P-AW-HEAE

Trace arrows back from the lower right to top left

• Diagonal – both• Up – upper gap • Left – lower gap

Summary

Uses recursion to fill in intermediate results table

Uses O(nm) space and time O(n2) algorithm Feasible for moderate sized sequences, but not

for aligning whole genomes.

reminder -structure of a genome human 3x10 9 bp genome: ~30,000 genes ~200,000 exons ~23 mb coding...

Documents

gap of length

edit distance

transcripta gap

dna sequences

stringsglobal alignment

optimal alignment

linear cost

linear sequence