challenging algorithms in bioinformatics...imppgortant challenges in bioinformatics s l thf t l hll...
TRANSCRIPT
-
Challenging algorithms in bioinformatics
T bjø RTorbjørn [email protected]
Department of Informatics, UiODepartment of Informatics, UiO23 September 2010
-
Wh t i bi i f ti ?What is bioinformatics?Definition:
Bioinformatics is the development and use of computational and Bioinformatics is the development and use of computational and mathematical methods to gather, process and interpret molecular biological data.
Aim of research:Aim of research:
To increase our understanding of the connections between biological processes at different levels while developing better theories and methods in computer science and statistics.methods in computer science and statistics.
An interdisciplinary subject:
Computer science/statistics + biology/medicineComputer science/statistics + biology/medicine
-
Main directions within bioinformatics
Transcription analysis
Literature analysis Proteomics
p y
Evolutionary biology Systems biologyBioinformatics
Sequence analysis Genetic variationSequence analysis
Structural biologyStructural biology
-
Bi i f ti t l l i bi lBioinformatics at many levels in biology
DNA RNA Protein
Cell Organism PopulationMolecules
Organ
Genomics Transcript-omics
Sequence analysis
Population genetics
Systems biology Personal
di iAssembly
Gene finding
Sequence analysis
RNomics
Microarrays
Structural biology
Proteomics
SNP analysis
Biobanks
Simulations Neuro-informatics ...
medicine
analysis
Annotation RNA folding
Proteomics
Drug design
All levels: mining the literature (natural language processing)
-
Important challenges in bioinformaticsp g
S l f th t l h ll i bi i f ti h Some examples of the central challenges in bioinformatics where improvement in algorithms are needed (hardest problems first):
Protein structure prediction Whole-genome sequence assembly from short sequences Alignment of multiple sequences Alignment of multiple sequences Sequence database searching Short read mapping
5
-
Some basic molecular biologygy
The genome is our complete genetic materialg p g The genome consists of DNA
From 2 to 3000 million nucleotides DNA is built from nucleotides DNA is built from nucleotides
4 nucleotides or bases: A(denine), C(ytosine), G(uanine), T(hymine) Genes are pieces of DNA DNA is transcribed into RNA when genes are expressed RNA is similar to DNA, but with a different backbone and uracil
4 nucleotides or bases: A, C, G, U RNA is translated into proteins based on codons (3 bases) Proteins consist of amino acids (on average 350 aa) There are 20 different natural amino acids: There are 20 different natural amino acids:
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y Proteins form 3D structures that act as molecular machines or as
structural building blocksstructural building blocks
6
-
Protein structure predictionp
Hardest problem (“Holy grail”): predict 3D t i t t di tl f protein structure directly from sequence
“ab initio“ “homology modelling” “threading”
Protein secondary structure prediction (easier) Predict helixes, strands and loops Not 3D
“Folding@Home”
7
-
Protein 3D structure and designg
MPARALLPRRMGHRTLASTPALWASIPCPR Structure predictionS S CSELRLDLVLPSGQSFRWREQSPAHWSGVLA DQVWTLTQTEEQLHC
Structure prediction
DQVWTLTQTEEQLHCTVYRGDKSQASRPTPDELEAVRKYFQLDVTLAQLYHHWGSVD
Protein design
LAQLYHHWGSVD...
-
Whole genome sequence assemblyg q y
Genome sequencing results in Genome sequencing results in millions of small pieces of the full genome
The challenge is to puzzle The challenge is to puzzle these together in the right orderG i i f Genome size ranging from 2Mbp (bacteria) to 3Gbp (human)R d i f 30 b t 1000 Read size from 30 bp to 1000 bp
Sequencing errors Natural variation (allels) Repeats and similar regions
-
Alignment algorithmsg g
Global Local
Dynamic programming Dynamic programmingPairwise
y p g g(Needleman-Wunsch)
y p g g(Smith-Waterman)
FASTABLASTBLASTPARALIGN
Dynamic programming Dynamic programming
Multiple
Dynamic programming
ClustalWT-Coffee
Dynamic programming
MEMET CoffeeMUSCLEMAFFT
N t d i i i ti l l ti b t i t l
10
Note: dynamic programming gives an optimal solution, but is extremelyspace and time demanding with many sequences (n > 2)
-
Multiple sequence alignmentp q g
Align three or more sequences Show corresponding amino acids in the different proteins Place gaps at correct positions Impossible to solve optimally by brute force for more than a few Impossible to solve optimally by brute force for more than a few
short sequences
11
-
Pairwise sequence alignmentPairwise sequence alignment
E li AlkA H OGG1E.coli AlkAHollis et al. (2000) EMBO J. 19, 758-766 (PDB ID 1DIZ)
Human OGG1Source: Bruner et al. (2000) Nature 403, 859-866 (PDB ID 1EBM)
E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183++| + |+ | +| || + | ||+ | || + +| |+ ||+ || +
H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209
E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL 225| | || | |+| | | ||+| |+ || | || | |+| | | ||+| |+ |
H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL 256
-
Common alignment scoring systemg g y
Substitution score matrixS f li i t id t h th
BLOSUM62 amino acid substituition score matrix
A R N D C Q E G H I L K M F P S T W Y V Score for aligning any two residues to each other Identical residues have large positive scores Similar residues have small positive scores Very different residues have large negative scores
A R N D C Q E G H I L K M F P S T W Y V
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 Very different residues have large negative scores
Gap penalties Penalty for opening a gap in a sequence (Q)
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W 3 3 4 4 2 2 3 2 2 3 2 3 1 1 4 3 2 11 2 3 Penalty for opening a gap in a sequence (Q)
Penalty for extending a gap (R) Typical gap function: G = Q + R * L, where L is length of gap Example: Q=11, R=1
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183++| + |+ | +| || + | ||+ | || + +| |+ ||+ || +
H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209
E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL 225| | || | |+| | | ||+| |+ || | || | |+| | | ||+| |+ |
H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL 256
-
How to find the best alignment(s)?g ( )
Th t ibl li t f t t There are too many possible alignments of two sequences to enable examination of every possible alignment
There is an algorithm to identify the best (optimal) alignment(s), i e the one(s) with the highest scorei.e the one(s) with the highest score
The algorithm is of the dynamic programming (DP) type The algorithm was initially described by Needleman and Wunsch
in 1970in 1970 Two steps:
First, identify the highest possible score using DP Then identify the alignment(s) with the highest score (using Then, identify the alignment(s) with the highest score (using
temporary results from the initial step) Dynamic programming:
General method for solving recursive problems by storing temporary g p y g p yresults from smaller problems along the way
Used to solve many problems in bioinformatics
14
-
Dynamic programmingy p g g
Assume that we are aligning the sequences x and y. We may denote the cells using coordinates:
y y2 y3 y y1 y2 y3 y4
(0,0) (0,1) (0,2) (0,3) (0,4)
x1 (1,0) (1,1) (1,2) (1,3) (1,4)x1 (1,0) (1,1) (1,2) (1,3) (1,4)
x2 (2,0) (2,1) (2,2) (2,3) (2,4)
x3 (3,0) (3,1) (3,2) (3,3) (3,4)x3 (3,0) (3,1) (3,2) (3,3) (3,4)
x4 (4,0) (4,1) (4,2) (4,3) (4,4)
• Any path to (s,t) corresponds to a global alignment of (x1:s, y1:t). • Any such alignment (w,z) has a score S(w,z).
15
• Let B(s,t) be the highest score (= optimal score) among these
-
Dynamic programmingy p g g
Assume a score function S(...) with gap penalty g.Assume a score function S(...) with gap penalty g.We are going to fill in the optimal score in each cell:
y1 y2 y3 y4 B(0,0) B(0,1) B(0,2) B(0,3) B(0,4)
x1 B(1,0) B(1,1) B(1,2) B(1,3) B(1,4)
x2 B(2,0) B(2,1) B(2,2) B(2,3) B(2,4)
x3 B(3,0) B(3,1) B(3,2) B(3,3) B(3,4)
x4 B(4,0) B(4,1) B(4,2) B(4,3) B(4,4)Optimal score forthe alignmentof x and y
1
(0,0) 0( ,0) ( , ) ... ( , )(0 ) ( ) ( )
i
BB i S x S x igB j S S j
1(0, ) ( , ) ... ( , )
( 1, 1) ( , )( , ) max ( 1, ) ( , )
j
i j
i
B j S y S y jg
B i j S x yB i j B i j S x
16
( , ) max ( 1, ) ( , )( , 1) ( , )
i
j
B i j B i j S xB i j S y
-
Dynamic programmingy p g g
Score function:• S(a,a) = 2• S(a,b) = 0 for a ≠ b• S(a,-) = -1
SSequences:x = TDWAXy = ZTDX
Computing B(i j) values:Computing B(i,j) values:
Z T D X
T0 -1 -2 -3
-1-4
0 1 0 -1DWA
-2-3-4
-1 0 3 2-2 -1 2 3
3 2 1 2
17
AX
4-5
-3 -2 1 230-3-4
-
Dynamic programmingy p g g
Z T D X
T0 -1 -2 -3
1-4
0 1 0 1TDW
-1-2-3
0 1 0-1 0 3 2
-1
-2 -1 2 3
AX
-4-5
-3 -2 1 230-3-4
ZTD--X-TDWAX
S = (-1) + 2 + 2 + (-1) + (-1) + 2 = 3
18
( ) ( ) ( )
-
Algorithmic complexityg p y
Assume that we are aligning two sequences of length m and n Assume that we are aligning two sequences of length m and n
Memory: O(nm)A fixed number of tables (one or two) with n*m cells: constant * nmA fixed number of additional variables: constantLittle memory needed if we are only interested in the best score
Time: O(nm)Calculate B(i,j) and P(i,j) for n*m cells in the table: constant * nmPerform traceback: constant * (n+m)Perform traceback: constant (n+m)
Assume for example that we use 8 bytes of memory per cell in the t bl ( di t t i t i bl i J ) Ali t f t table (corresponding to two int-variables in Java). Alignment of two DNA sequences of length 10 000 requires 10 000 * 10 000 * 8 = 800 Mb memory
19
length 100 000 requires 100 000 * 100 000 * 8 = 80 Gb memory
-
Other implementationsp
Ali t t l ith ffi lti (A Alignment gets more complex with affine gap penalties (A + B*gap length).
To find the actual alignment, a divide-and-conquer type of algorithm exists. It uses only a linear amount of memory, O(m+n). It uses somewhat more time than the other algorithm. Suggested by Hirschberg as well as Huang, Hardison and Miller.
Implementations exploiting parallel computing technology have Implementations exploiting parallel computing technology have speeded up Smith-Waterman alignments substantially.
Many different heuristic algorithms have been suggested E g Many different heuristic algorithms have been suggested. E.g. BLAST which is widely used for searching databases.
20
-
Searching sequence databases
• Goal: Identify which sequences in a database are significantly
g q
Goal: Identify which sequences in a database are significantly similar to a given DNA, RNA or protein sequence.
• How: The query sequence is compared (aligned) with each of the database sequences and the amount of similarity is the database sequences, and the amount of similarity is determined for each database sequence.
Example:Query sequence: acgatcgattagcca
Database sequences:Identical (trivial): acgatcgattagccaIdentical (trivial): acgatcgattagccaVery similar (easy): acgaccgatgagccaSimilar (moderate): atgacggatgagcgaV di d (h d)
21
Very diverged (hard): atgacgggatgagcga
-
Searching databasesg
B d l l li t f ith d t b Based on local alignments of query sequence with every database sequence
Exhaustive / optimal / brute-force: Smith-Waterman
Heuristic: BLAST FASTA PARALIGN SSAHA PatternHunter Heuristic: BLAST, FASTA, PARALIGN, SSAHA, PatternHunter, BLAT, ...
Speed difference: about 50X Speed difference: about 50X Sensitivity Specificity
22
-
Heuristic search algorithmsg
M h i ti l ith t i d i t ith th Many heuristic algorithms creates an index into either the query sequence or the database sequences
The index / hash key is short for sensitive algorithms and longer for quicker methods
Allows fast lookup of matching positions Software:
BLAST, FASTA, SSAHA, BLAT, PatternHunter Index key size
2-7 amino acids (BLASTP) 2 7 amino acids (BLASTP) 1-2 amino acids (FASTA) 1-6 nucleotides (FASTA) 11-28 nucleotides (BLASTN MEGABLAST) 11-28 nucleotides (BLASTN, MEGABLAST) 10 nucleotides (SSAHA) 11 nucleotides (BLAT)
23
-
Short read mappingShort read mapping
• Input:– A reference genome– A collection of many 25-100bp tags– User-specified parameters
• Output:– One or more genomic coordinates for each tagg g
• In practice only 70-75% of tags • In practice, only 70 75% of tags successfully map to the reference genomegenome.
-
ACGTACGTACGTACGTACGTACGTACGTACGTACGT
TCAGTGCATTGCACCCTGGGTGTGATCGTGCATGTG
CACGTACGTGGTTTTGCAACGTGCTGATGCTAGCAT
-
The problemp
I tInput: Tens of millions of reads 25-100bp long readsp g Approx 1Gbp per run Sequencing errors
Reference genome: 3 Gbp sequence Human genome variants; genome repeats
Requirements:Requirements: Speed Accuracy
-
Sh t d l i ftShort read analysis software
-
Short read mapping algorithmspp g g
Q filt i Q-gram filtering At least one relatively long match A few mismatches allowed
“Spaced seeds” Multiple short matches Multiple mismatches allowed
Burrows wheeler Exact match
28
-
Q-gram filteringQ g g
I t t i t hi Inexact string matching Filtering algorithm by Baeza-Yates and Perleberg Using reads of length n and allowing for up to k mismatches
h d d d d k d Each read divided into k+1 continous seeds Each seed has a length of n / (k+1) There must be at least one matching seed for each read
Used by RMAP (10 nucleotide seeds)
-
Q-gram filtering
G
Q g g
TCAGTGCATTGCACCCTGGGTGTGATCCTGCATGTG Read (36bp)
...TCAGTGCCTTGCACCCTGGGTGTGATCGTGCATGTG... Genome
TCAGTGCATTGC ACCCTGGGTGTG ATCCTGCATGTG
Read (36bp)
Seeds (3 x 12bp)ACCCTGGGTGTG ATCCTGCATGTG Seeds (3 x 12bp)
At least one of these must match even with two mismatches
-
Q-gram filteringQ g g
All d di id d i t Seed Read PosAll reads are divided into seeds and a hash table is built containg information about all seeds
AAAAAAAAAAAA 38723 25
...about all seeds
The table contains the read id and the position within the
d f ll d
ACCCTGGGTGTG 1 13
...read for all seeds
Fast retrieval ATCCTGCATGTG 1 25...
CTATTGCACGTC 17 1
...
GTGGTATTTGCT 4343 13
...
TCAGTGCATTGC 1 1
...
-
S d d li tSpaced seed alignment Tags and tag-sized pieces of
efe en e e t into m ll reference are cut into small “seeds.”
Pairs of spaced seeds are t d i i dstored in an index.
Look up spaced seeds for each tag.
For each “hit,” confirm the remaining positions.
Report results to the user.p
-
B Wh lBurrows-Wheeler Store entire reference genome. Align tag base by base from
the end. When tag is traversed, all
active locations are reported. If no match is found, then back
up and try a substitution.
-
B Wh l t fBurrows-Wheeler transform
-
D t t tData structures
-
C iComparison
Burrows-Wheeler Spaced seeds Burrows-Wheeler• Requires