challenging algorithms in bioinformatics...imppgortant challenges in bioinformatics s l thf t l hll...

Challenging algorithms in bioinformatics

T bjø RTorbjørn [email protected]

Department of Informatics, UiODepartment of Informatics, UiO23 September 2010

Wh t i bi i f ti ?What is bioinformatics?Definition:

Bioinformatics is the development and use of computational and Bioinformatics is the development and use of computational and mathematical methods to gather, process and interpret molecular biological data.

Aim of research:Aim of research:

To increase our understanding of the connections between biological processes at different levels while developing better theories and methods in computer science and statistics.methods in computer science and statistics.

An interdisciplinary subject:

Computer science/statistics + biology/medicineComputer science/statistics + biology/medicine

Main directions within bioinformatics

Transcription analysis

Literature analysis Proteomics

p y

Evolutionary biology Systems biologyBioinformatics

Sequence analysis Genetic variationSequence analysis

Structural biologyStructural biology

Bi i f ti t l l i bi lBioinformatics at many levels in biology

DNA RNA Protein

Cell Organism PopulationMolecules

Organ

Genomics Transcript-omics

Sequence analysis

Population genetics

Systems biology Personal

di iAssembly

Gene finding

Sequence analysis

RNomics

Microarrays

Structural biology

Proteomics

SNP analysis

Biobanks

Simulations Neuro-informatics ...

medicine

analysis

Annotation RNA folding

Proteomics

Drug design

All levels: mining the literature (natural language processing)

Important challenges in bioinformaticsp g

S l f th t l h ll i bi i f ti h Some examples of the central challenges in bioinformatics where improvement in algorithms are needed (hardest problems first):

Protein structure prediction Whole-genome sequence assembly from short sequences Alignment of multiple sequences Alignment of multiple sequences Sequence database searching Short read mapping

5

Some basic molecular biologygy

The genome is our complete genetic materialg p g The genome consists of DNA

From 2 to 3000 million nucleotides DNA is built from nucleotides DNA is built from nucleotides

4 nucleotides or bases: A(denine), C(ytosine), G(uanine), T(hymine) Genes are pieces of DNA DNA is transcribed into RNA when genes are expressed RNA is similar to DNA, but with a different backbone and uracil

4 nucleotides or bases: A, C, G, U RNA is translated into proteins based on codons (3 bases) Proteins consist of amino acids (on average 350 aa) There are 20 different natural amino acids: There are 20 different natural amino acids:

A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y Proteins form 3D structures that act as molecular machines or as

structural building blocksstructural building blocks

6

Protein structure predictionp

Hardest problem (“Holy grail”): predict 3D t i t t di tl f protein structure directly from sequence

“ab initio“ “homology modelling” “threading”

Protein secondary structure prediction (easier) Predict helixes, strands and loops Not 3D

“Folding@Home”

7

Protein 3D structure and designg

MPARALLPRRMGHRTLASTPALWASIPCPR Structure predictionS S CSELRLDLVLPSGQSFRWREQSPAHWSGVLA DQVWTLTQTEEQLHC

Structure prediction

DQVWTLTQTEEQLHCTVYRGDKSQASRPTPDELEAVRKYFQLDVTLAQLYHHWGSVD

Protein design

LAQLYHHWGSVD...

Whole genome sequence assemblyg q y

Genome sequencing results in Genome sequencing results in millions of small pieces of the full genome

The challenge is to puzzle The challenge is to puzzle these together in the right orderG i i f Genome size ranging from 2Mbp (bacteria) to 3Gbp (human)R d i f 30 b t 1000 Read size from 30 bp to 1000 bp

Sequencing errors Natural variation (allels) Repeats and similar regions

Alignment algorithmsg g

Global Local

Dynamic programming Dynamic programmingPairwise

y p g g(Needleman-Wunsch)

y p g g(Smith-Waterman)

FASTABLASTBLASTPARALIGN

Dynamic programming Dynamic programming

Multiple

Dynamic programming

ClustalWT-Coffee

Dynamic programming

MEMET CoffeeMUSCLEMAFFT

N t d i i i ti l l ti b t i t l

10

Note: dynamic programming gives an optimal solution, but is extremelyspace and time demanding with many sequences (n > 2)

Multiple sequence alignmentp q g

Align three or more sequences Show corresponding amino acids in the different proteins Place gaps at correct positions Impossible to solve optimally by brute force for more than a few Impossible to solve optimally by brute force for more than a few

short sequences

11

Pairwise sequence alignmentPairwise sequence alignment

E li AlkA H OGG1E.coli AlkAHollis et al. (2000) EMBO J. 19, 758-766 (PDB ID 1DIZ)

Human OGG1Source: Bruner et al. (2000) Nature 403, 859-866 (PDB ID 1EBM)

E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183++| + |+ | +| || + | ||+ | || + +| |+ ||+ || +

H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209

E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL 225| | || | |+| | | ||+| |+ || | || | |+| | | ||+| |+ |

H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL 256

Common alignment scoring systemg g y

Substitution score matrixS f li i t id t h th

BLOSUM62 amino acid substituition score matrix

A R N D C Q E G H I L K M F P S T W Y V Score for aligning any two residues to each other Identical residues have large positive scores Similar residues have small positive scores Very different residues have large negative scores

A R N D C Q E G H I L K M F P S T W Y V

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 Very different residues have large negative scores

Gap penalties Penalty for opening a gap in a sequence (Q)

H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W 3 3 4 4 2 2 3 2 2 3 2 3 1 1 4 3 2 11 2 3 Penalty for opening a gap in a sequence (Q)

Penalty for extending a gap (R) Typical gap function: G = Q + R * L, where L is length of gap Example: Q=11, R=1

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183++| + |+ | +| || + | ||+ | || + +| |+ ||+ || +

H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209

E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL 225| | || | |+| | | ||+| |+ || | || | |+| | | ||+| |+ |

H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL 256

How to find the best alignment(s)?g ( )

Th t ibl li t f t t There are too many possible alignments of two sequences to enable examination of every possible alignment

There is an algorithm to identify the best (optimal) alignment(s), i e the one(s) with the highest scorei.e the one(s) with the highest score

The algorithm is of the dynamic programming (DP) type The algorithm was initially described by Needleman and Wunsch

in 1970in 1970 Two steps:

First, identify the highest possible score using DP Then identify the alignment(s) with the highest score (using Then, identify the alignment(s) with the highest score (using

temporary results from the initial step) Dynamic programming:

General method for solving recursive problems by storing temporary g p y g p yresults from smaller problems along the way

Used to solve many problems in bioinformatics

14

Dynamic programmingy p g g

Assume that we are aligning the sequences x and y. We may denote the cells using coordinates:

y y2 y3 y y1 y2 y3 y4

(0,0) (0,1) (0,2) (0,3) (0,4)

x1 (1,0) (1,1) (1,2) (1,3) (1,4)x1 (1,0) (1,1) (1,2) (1,3) (1,4)

x2 (2,0) (2,1) (2,2) (2,3) (2,4)

x3 (3,0) (3,1) (3,2) (3,3) (3,4)x3 (3,0) (3,1) (3,2) (3,3) (3,4)

x4 (4,0) (4,1) (4,2) (4,3) (4,4)

• Any path to (s,t) corresponds to a global alignment of (x1:s, y1:t). • Any such alignment (w,z) has a score S(w,z).

15

• Let B(s,t) be the highest score (= optimal score) among these


Assume a score function S(...) with gap penalty g.Assume a score function S(...) with gap penalty g.We are going to fill in the optimal score in each cell:

y1 y2 y3 y4 B(0,0) B(0,1) B(0,2) B(0,3) B(0,4)

x1 B(1,0) B(1,1) B(1,2) B(1,3) B(1,4)

x2 B(2,0) B(2,1) B(2,2) B(2,3) B(2,4)

x3 B(3,0) B(3,1) B(3,2) B(3,3) B(3,4)

x4 B(4,0) B(4,1) B(4,2) B(4,3) B(4,4)Optimal score forthe alignmentof x and y

1

(0,0) 0( ,0) ( , ) ... ( , )(0 ) ( ) ( )

i

BB i S x S x igB j S S j

1(0, ) ( , ) ... ( , )

( 1, 1) ( , )( , ) max ( 1, ) ( , )

j

i j

i

B j S y S y jg

B i j S x yB i j B i j S x

16

( , ) max ( 1, ) ( , )( , 1) ( , )

i

j

B i j B i j S xB i j S y


Score function:• S(a,a) = 2• S(a,b) = 0 for a ≠ b• S(a,-) = -1

SSequences:x = TDWAXy = ZTDX

Computing B(i j) values:Computing B(i,j) values:

Z T D X

T0 -1 -2 -3

-1-4

0 1 0 -1DWA

-2-3-4

-1 0 3 2-2 -1 2 3

3 2 1 2

17

AX

4-5

-3 -2 1 230-3-4


Z T D X

T0 -1 -2 -3

1-4

0 1 0 1TDW

-1-2-3

0 1 0-1 0 3 2

-1

-2 -1 2 3

AX

-4-5

-3 -2 1 230-3-4

ZTD--X-TDWAX

S = (-1) + 2 + 2 + (-1) + (-1) + 2 = 3

18

( ) ( ) ( )

Algorithmic complexityg p y

Assume that we are aligning two sequences of length m and n Assume that we are aligning two sequences of length m and n

Memory: O(nm)A fixed number of tables (one or two) with n*m cells: constant * nmA fixed number of additional variables: constantLittle memory needed if we are only interested in the best score

Time: O(nm)Calculate B(i,j) and P(i,j) for n*m cells in the table: constant * nmPerform traceback: constant * (n+m)Perform traceback: constant (n+m)

Assume for example that we use 8 bytes of memory per cell in the t bl ( di t t i t i bl i J ) Ali t f t table (corresponding to two int-variables in Java). Alignment of two DNA sequences of length 10 000 requires 10 000 * 10 000 * 8 = 800 Mb memory

19

length 100 000 requires 100 000 * 100 000 * 8 = 80 Gb memory

Other implementationsp

Ali t t l ith ffi lti (A Alignment gets more complex with affine gap penalties (A + B*gap length).

To find the actual alignment, a divide-and-conquer type of algorithm exists. It uses only a linear amount of memory, O(m+n). It uses somewhat more time than the other algorithm. Suggested by Hirschberg as well as Huang, Hardison and Miller.

Implementations exploiting parallel computing technology have Implementations exploiting parallel computing technology have speeded up Smith-Waterman alignments substantially.

Many different heuristic algorithms have been suggested E g Many different heuristic algorithms have been suggested. E.g. BLAST which is widely used for searching databases.

20

Searching sequence databases

• Goal: Identify which sequences in a database are significantly

g q

Goal: Identify which sequences in a database are significantly similar to a given DNA, RNA or protein sequence.

• How: The query sequence is compared (aligned) with each of the database sequences and the amount of similarity is the database sequences, and the amount of similarity is determined for each database sequence.

Example:Query sequence: acgatcgattagcca

Database sequences:Identical (trivial): acgatcgattagccaIdentical (trivial): acgatcgattagccaVery similar (easy): acgaccgatgagccaSimilar (moderate): atgacggatgagcgaV di d (h d)

21

Very diverged (hard): atgacgggatgagcga

Searching databasesg

B d l l li t f ith d t b Based on local alignments of query sequence with every database sequence

Exhaustive / optimal / brute-force: Smith-Waterman

Heuristic: BLAST FASTA PARALIGN SSAHA PatternHunter Heuristic: BLAST, FASTA, PARALIGN, SSAHA, PatternHunter, BLAT, ...

Speed difference: about 50X Speed difference: about 50X Sensitivity Specificity

22

Heuristic search algorithmsg

M h i ti l ith t i d i t ith th Many heuristic algorithms creates an index into either the query sequence or the database sequences

The index / hash key is short for sensitive algorithms and longer for quicker methods

Allows fast lookup of matching positions Software:

BLAST, FASTA, SSAHA, BLAT, PatternHunter Index key size

2-7 amino acids (BLASTP) 2 7 amino acids (BLASTP) 1-2 amino acids (FASTA) 1-6 nucleotides (FASTA) 11-28 nucleotides (BLASTN MEGABLAST) 11-28 nucleotides (BLASTN, MEGABLAST) 10 nucleotides (SSAHA) 11 nucleotides (BLAT)

23

Short read mappingShort read mapping

• Input:– A reference genome– A collection of many 25-100bp tags– User-specified parameters

• Output:– One or more genomic coordinates for each tagg g

• In practice only 70-75% of tags • In practice, only 70 75% of tags successfully map to the reference genomegenome.

ACGTACGTACGTACGTACGTACGTACGTACGTACGT

TCAGTGCATTGCACCCTGGGTGTGATCGTGCATGTG

CACGTACGTGGTTTTGCAACGTGCTGATGCTAGCAT

The problemp

I tInput: Tens of millions of reads 25-100bp long readsp g Approx 1Gbp per run Sequencing errors

Reference genome: 3 Gbp sequence Human genome variants; genome repeats

Requirements:Requirements: Speed Accuracy

Sh t d l i ftShort read analysis software

Short read mapping algorithmspp g g

Q filt i Q-gram filtering At least one relatively long match A few mismatches allowed

“Spaced seeds” Multiple short matches Multiple mismatches allowed

Burrows wheeler Exact match

28

Q-gram filteringQ g g

I t t i t hi Inexact string matching Filtering algorithm by Baeza-Yates and Perleberg Using reads of length n and allowing for up to k mismatches

h d d d d k d Each read divided into k+1 continous seeds Each seed has a length of n / (k+1) There must be at least one matching seed for each read

Used by RMAP (10 nucleotide seeds)

Q-gram filtering

G

Q g g

TCAGTGCATTGCACCCTGGGTGTGATCCTGCATGTG Read (36bp)

...TCAGTGCCTTGCACCCTGGGTGTGATCGTGCATGTG... Genome

TCAGTGCATTGC ACCCTGGGTGTG ATCCTGCATGTG

Read (36bp)

Seeds (3 x 12bp)ACCCTGGGTGTG ATCCTGCATGTG Seeds (3 x 12bp)

At least one of these must match even with two mismatches

Q-gram filteringQ g g

All d di id d i t Seed Read PosAll reads are divided into seeds and a hash table is built containg information about all seeds

AAAAAAAAAAAA 38723 25

...about all seeds

The table contains the read id and the position within the

d f ll d

ACCCTGGGTGTG 1 13

...read for all seeds

Fast retrieval ATCCTGCATGTG 1 25...

CTATTGCACGTC 17 1

...

GTGGTATTTGCT 4343 13

...

TCAGTGCATTGC 1 1

...

S d d li tSpaced seed alignment Tags and tag-sized pieces of

efe en e e t into m ll reference are cut into small “seeds.”

Pairs of spaced seeds are t d i i dstored in an index.

Look up spaced seeds for each tag.

For each “hit,” confirm the remaining positions.

Report results to the user.p

B Wh lBurrows-Wheeler Store entire reference genome. Align tag base by base from

the end. When tag is traversed, all

active locations are reported. If no match is found, then back

up and try a substitution.

B Wh l t fBurrows-Wheeler transform

D t t tData structures

C iComparison

Burrows-Wheeler Spaced seeds Burrows-Wheeler• Requires

challenging algorithms in bioinformatics...imppgortant challenges in bioinformatics s l thf t l hll...

Documents