challenging algorithms in bioinformatics...imppgortant challenges in bioinformatics s l thf t l hll...

36
Challenging algorithms in bioinformatics T bjø R Torbjørn Rognes [email protected] Department of Informatics, UiO Department of Informatics, UiO 23 September 2010

Upload: others

Post on 28-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Challenging algorithms in bioinformatics

    T bjø RTorbjørn [email protected]

    Department of Informatics, UiODepartment of Informatics, UiO23 September 2010

  • Wh t i bi i f ti ?What is bioinformatics?Definition:

    Bioinformatics is the development and use of computational and Bioinformatics is the development and use of computational and mathematical methods to gather, process and interpret molecular biological data.

    Aim of research:Aim of research:

    To increase our understanding of the connections between biological processes at different levels while developing better theories and methods in computer science and statistics.methods in computer science and statistics.

    An interdisciplinary subject:

    Computer science/statistics + biology/medicineComputer science/statistics + biology/medicine

  • Main directions within bioinformatics

    Transcription analysis

    Literature analysis Proteomics

    p y

    Evolutionary biology Systems biologyBioinformatics

    Sequence analysis Genetic variationSequence analysis

    Structural biologyStructural biology

  • Bi i f ti t l l i bi lBioinformatics at many levels in biology

    DNA RNA Protein

    Cell Organism PopulationMolecules

    Organ

    Genomics Transcript-omics

    Sequence analysis

    Population genetics

    Systems biology Personal

    di iAssembly

    Gene finding

    Sequence analysis

    RNomics

    Microarrays

    Structural biology

    Proteomics

    SNP analysis

    Biobanks

    Simulations Neuro-informatics ...

    medicine

    analysis

    Annotation RNA folding

    Proteomics

    Drug design

    All levels: mining the literature (natural language processing)

  • Important challenges in bioinformaticsp g

    S l f th t l h ll i bi i f ti h Some examples of the central challenges in bioinformatics where improvement in algorithms are needed (hardest problems first):

    Protein structure prediction Whole-genome sequence assembly from short sequences Alignment of multiple sequences Alignment of multiple sequences Sequence database searching Short read mapping

    5

  • Some basic molecular biologygy

    The genome is our complete genetic materialg p g The genome consists of DNA

    From 2 to 3000 million nucleotides DNA is built from nucleotides DNA is built from nucleotides

    4 nucleotides or bases: A(denine), C(ytosine), G(uanine), T(hymine) Genes are pieces of DNA DNA is transcribed into RNA when genes are expressed RNA is similar to DNA, but with a different backbone and uracil

    4 nucleotides or bases: A, C, G, U RNA is translated into proteins based on codons (3 bases) Proteins consist of amino acids (on average 350 aa) There are 20 different natural amino acids: There are 20 different natural amino acids:

    A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y Proteins form 3D structures that act as molecular machines or as

    structural building blocksstructural building blocks

    6

  • Protein structure predictionp

    Hardest problem (“Holy grail”): predict 3D t i t t di tl f protein structure directly from sequence

    “ab initio“ “homology modelling” “threading”

    Protein secondary structure prediction (easier) Predict helixes, strands and loops Not 3D

    “Folding@Home”

    7

  • Protein 3D structure and designg

    MPARALLPRRMGHRTLASTPALWASIPCPR Structure predictionS S CSELRLDLVLPSGQSFRWREQSPAHWSGVLA DQVWTLTQTEEQLHC

    Structure prediction

    DQVWTLTQTEEQLHCTVYRGDKSQASRPTPDELEAVRKYFQLDVTLAQLYHHWGSVD

    Protein design

    LAQLYHHWGSVD...

  • Whole genome sequence assemblyg q y

    Genome sequencing results in Genome sequencing results in millions of small pieces of the full genome

    The challenge is to puzzle The challenge is to puzzle these together in the right orderG i i f Genome size ranging from 2Mbp (bacteria) to 3Gbp (human)R d i f 30 b t 1000 Read size from 30 bp to 1000 bp

    Sequencing errors Natural variation (allels) Repeats and similar regions

  • Alignment algorithmsg g

    Global Local

    Dynamic programming Dynamic programmingPairwise

    y p g g(Needleman-Wunsch)

    y p g g(Smith-Waterman)

    FASTABLASTBLASTPARALIGN

    Dynamic programming Dynamic programming

    Multiple

    Dynamic programming

    ClustalWT-Coffee

    Dynamic programming

    MEMET CoffeeMUSCLEMAFFT

    N t d i i i ti l l ti b t i t l

    10

    Note: dynamic programming gives an optimal solution, but is extremelyspace and time demanding with many sequences (n > 2)

  • Multiple sequence alignmentp q g

    Align three or more sequences Show corresponding amino acids in the different proteins Place gaps at correct positions Impossible to solve optimally by brute force for more than a few Impossible to solve optimally by brute force for more than a few

    short sequences

    11

  • Pairwise sequence alignmentPairwise sequence alignment

    E li AlkA H OGG1E.coli AlkAHollis et al. (2000) EMBO J. 19, 758-766 (PDB ID 1DIZ)

    Human OGG1Source: Bruner et al. (2000) Nature 403, 859-866 (PDB ID 1EBM)

    E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183++| + |+ | +| || + | ||+ | || + +| |+ ||+ || +

    H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209

    E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL 225| | || | |+| | | ||+| |+ || | || | |+| | | ||+| |+ |

    H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL 256

  • Common alignment scoring systemg g y

    Substitution score matrixS f li i t id t h th

    BLOSUM62 amino acid substituition score matrix

    A R N D C Q E G H I L K M F P S T W Y V Score for aligning any two residues to each other Identical residues have large positive scores Similar residues have small positive scores Very different residues have large negative scores

    A R N D C Q E G H I L K M F P S T W Y V

    A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 Very different residues have large negative scores

    Gap penalties Penalty for opening a gap in a sequence (Q)

    H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W 3 3 4 4 2 2 3 2 2 3 2 3 1 1 4 3 2 11 2 3 Penalty for opening a gap in a sequence (Q)

    Penalty for extending a gap (R) Typical gap function: G = Q + R * L, where L is length of gap Example: Q=11, R=1

    W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

    E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183++| + |+ | +| || + | ||+ | || + +| |+ ||+ || +

    H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209

    E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL 225| | || | |+| | | ||+| |+ || | || | |+| | | ||+| |+ |

    H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL 256

  • How to find the best alignment(s)?g ( )

    Th t ibl li t f t t There are too many possible alignments of two sequences to enable examination of every possible alignment

    There is an algorithm to identify the best (optimal) alignment(s), i e the one(s) with the highest scorei.e the one(s) with the highest score

    The algorithm is of the dynamic programming (DP) type The algorithm was initially described by Needleman and Wunsch

    in 1970in 1970 Two steps:

    First, identify the highest possible score using DP Then identify the alignment(s) with the highest score (using Then, identify the alignment(s) with the highest score (using

    temporary results from the initial step) Dynamic programming:

    General method for solving recursive problems by storing temporary g p y g p yresults from smaller problems along the way

    Used to solve many problems in bioinformatics

    14

  • Dynamic programmingy p g g

    Assume that we are aligning the sequences x and y. We may denote the cells using coordinates:

    y y2 y3 y y1 y2 y3 y4

    (0,0) (0,1) (0,2) (0,3) (0,4)

    x1 (1,0) (1,1) (1,2) (1,3) (1,4)x1 (1,0) (1,1) (1,2) (1,3) (1,4)

    x2 (2,0) (2,1) (2,2) (2,3) (2,4)

    x3 (3,0) (3,1) (3,2) (3,3) (3,4)x3 (3,0) (3,1) (3,2) (3,3) (3,4)

    x4 (4,0) (4,1) (4,2) (4,3) (4,4)

    • Any path to (s,t) corresponds to a global alignment of (x1:s, y1:t). • Any such alignment (w,z) has a score S(w,z).

    15

    • Let B(s,t) be the highest score (= optimal score) among these

  • Dynamic programmingy p g g

    Assume a score function S(...) with gap penalty g.Assume a score function S(...) with gap penalty g.We are going to fill in the optimal score in each cell:

    y1 y2 y3 y4 B(0,0) B(0,1) B(0,2) B(0,3) B(0,4)

    x1 B(1,0) B(1,1) B(1,2) B(1,3) B(1,4)

    x2 B(2,0) B(2,1) B(2,2) B(2,3) B(2,4)

    x3 B(3,0) B(3,1) B(3,2) B(3,3) B(3,4)

    x4 B(4,0) B(4,1) B(4,2) B(4,3) B(4,4)Optimal score forthe alignmentof x and y

    1

    (0,0) 0( ,0) ( , ) ... ( , )(0 ) ( ) ( )

    i

    BB i S x S x igB j S S j

    1(0, ) ( , ) ... ( , )

    ( 1, 1) ( , )( , ) max ( 1, ) ( , )

    j

    i j

    i

    B j S y S y jg

    B i j S x yB i j B i j S x

    16

    ( , ) max ( 1, ) ( , )( , 1) ( , )

    i

    j

    B i j B i j S xB i j S y

  • Dynamic programmingy p g g

    Score function:• S(a,a) = 2• S(a,b) = 0 for a ≠ b• S(a,-) = -1

    SSequences:x = TDWAXy = ZTDX

    Computing B(i j) values:Computing B(i,j) values:

    Z T D X

    T0 -1 -2 -3

    -1-4

    0 1 0 -1DWA

    -2-3-4

    -1 0 3 2-2 -1 2 3

    3 2 1 2

    17

    AX

    4-5

    -3 -2 1 230-3-4

  • Dynamic programmingy p g g

    Z T D X

    T0 -1 -2 -3

    1-4

    0 1 0 1TDW

    -1-2-3

    0 1 0-1 0 3 2

    -1

    -2 -1 2 3

    AX

    -4-5

    -3 -2 1 230-3-4

    ZTD--X-TDWAX

    S = (-1) + 2 + 2 + (-1) + (-1) + 2 = 3

    18

    ( ) ( ) ( )

  • Algorithmic complexityg p y

    Assume that we are aligning two sequences of length m and n Assume that we are aligning two sequences of length m and n

    Memory: O(nm)A fixed number of tables (one or two) with n*m cells: constant * nmA fixed number of additional variables: constantLittle memory needed if we are only interested in the best score

    Time: O(nm)Calculate B(i,j) and P(i,j) for n*m cells in the table: constant * nmPerform traceback: constant * (n+m)Perform traceback: constant (n+m)

    Assume for example that we use 8 bytes of memory per cell in the t bl ( di t t i t i bl i J ) Ali t f t table (corresponding to two int-variables in Java). Alignment of two DNA sequences of length 10 000 requires 10 000 * 10 000 * 8 = 800 Mb memory

    19

    length 100 000 requires 100 000 * 100 000 * 8 = 80 Gb memory

  • Other implementationsp

    Ali t t l ith ffi lti (A Alignment gets more complex with affine gap penalties (A + B*gap length).

    To find the actual alignment, a divide-and-conquer type of algorithm exists. It uses only a linear amount of memory, O(m+n). It uses somewhat more time than the other algorithm. Suggested by Hirschberg as well as Huang, Hardison and Miller.

    Implementations exploiting parallel computing technology have Implementations exploiting parallel computing technology have speeded up Smith-Waterman alignments substantially.

    Many different heuristic algorithms have been suggested E g Many different heuristic algorithms have been suggested. E.g. BLAST which is widely used for searching databases.

    20

  • Searching sequence databases

    • Goal: Identify which sequences in a database are significantly

    g q

    Goal: Identify which sequences in a database are significantly similar to a given DNA, RNA or protein sequence.

    • How: The query sequence is compared (aligned) with each of the database sequences and the amount of similarity is the database sequences, and the amount of similarity is determined for each database sequence.

    Example:Query sequence: acgatcgattagcca

    Database sequences:Identical (trivial): acgatcgattagccaIdentical (trivial): acgatcgattagccaVery similar (easy): acgaccgatgagccaSimilar (moderate): atgacggatgagcgaV di d (h d)

    21

    Very diverged (hard): atgacgggatgagcga

  • Searching databasesg

    B d l l li t f ith d t b Based on local alignments of query sequence with every database sequence

    Exhaustive / optimal / brute-force: Smith-Waterman

    Heuristic: BLAST FASTA PARALIGN SSAHA PatternHunter Heuristic: BLAST, FASTA, PARALIGN, SSAHA, PatternHunter, BLAT, ...

    Speed difference: about 50X Speed difference: about 50X Sensitivity Specificity

    22

  • Heuristic search algorithmsg

    M h i ti l ith t i d i t ith th Many heuristic algorithms creates an index into either the query sequence or the database sequences

    The index / hash key is short for sensitive algorithms and longer for quicker methods

    Allows fast lookup of matching positions Software:

    BLAST, FASTA, SSAHA, BLAT, PatternHunter Index key size

    2-7 amino acids (BLASTP) 2 7 amino acids (BLASTP) 1-2 amino acids (FASTA) 1-6 nucleotides (FASTA) 11-28 nucleotides (BLASTN MEGABLAST) 11-28 nucleotides (BLASTN, MEGABLAST) 10 nucleotides (SSAHA) 11 nucleotides (BLAT)

    23

  • Short read mappingShort read mapping

    • Input:– A reference genome– A collection of many 25-100bp tags– User-specified parameters

    • Output:– One or more genomic coordinates for each tagg g

    • In practice only 70-75% of tags • In practice, only 70 75% of tags successfully map to the reference genomegenome.

  • ACGTACGTACGTACGTACGTACGTACGTACGTACGT

    TCAGTGCATTGCACCCTGGGTGTGATCGTGCATGTG

    CACGTACGTGGTTTTGCAACGTGCTGATGCTAGCAT

  • The problemp

    I tInput: Tens of millions of reads 25-100bp long readsp g Approx 1Gbp per run Sequencing errors

    Reference genome: 3 Gbp sequence Human genome variants; genome repeats

    Requirements:Requirements: Speed Accuracy

  • Sh t d l i ftShort read analysis software

  • Short read mapping algorithmspp g g

    Q filt i Q-gram filtering At least one relatively long match A few mismatches allowed

    “Spaced seeds” Multiple short matches Multiple mismatches allowed

    Burrows wheeler Exact match

    28

  • Q-gram filteringQ g g

    I t t i t hi Inexact string matching Filtering algorithm by Baeza-Yates and Perleberg Using reads of length n and allowing for up to k mismatches

    h d d d d k d Each read divided into k+1 continous seeds Each seed has a length of n / (k+1) There must be at least one matching seed for each read

    Used by RMAP (10 nucleotide seeds)

  • Q-gram filtering

    G

    Q g g

    TCAGTGCATTGCACCCTGGGTGTGATCCTGCATGTG Read (36bp)

    ...TCAGTGCCTTGCACCCTGGGTGTGATCGTGCATGTG... Genome

    TCAGTGCATTGC ACCCTGGGTGTG ATCCTGCATGTG

    Read (36bp)

    Seeds (3 x 12bp)ACCCTGGGTGTG ATCCTGCATGTG Seeds (3 x 12bp)

    At least one of these must match even with two mismatches

  • Q-gram filteringQ g g

    All d di id d i t Seed Read PosAll reads are divided into seeds and a hash table is built containg information about all seeds

    AAAAAAAAAAAA 38723 25

    ...about all seeds

    The table contains the read id and the position within the

    d f ll d

    ACCCTGGGTGTG 1 13

    ...read for all seeds

    Fast retrieval ATCCTGCATGTG 1 25...

    CTATTGCACGTC 17 1

    ...

    GTGGTATTTGCT 4343 13

    ...

    TCAGTGCATTGC 1 1

    ...

  • S d d li tSpaced seed alignment Tags and tag-sized pieces of

    efe en e e t into m ll reference are cut into small “seeds.”

    Pairs of spaced seeds are t d i i dstored in an index.

    Look up spaced seeds for each tag.

    For each “hit,” confirm the remaining positions.

    Report results to the user.p

  • B Wh lBurrows-Wheeler Store entire reference genome. Align tag base by base from

    the end. When tag is traversed, all

    active locations are reported. If no match is found, then back

    up and try a substitution.

  • B Wh l t fBurrows-Wheeler transform

  • D t t tData structures

  • C iComparison

    Burrows-Wheeler Spaced seeds Burrows-Wheeler• Requires