sequence comparison with dot matrices

Upload: preetirupa-saikia

Post on 06-Apr-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Sequence Comparison With Dot Matrices

    1/30

    Computational Biology, Part 2Sequence Comparison with Dot

    Matrices

    Robert F. Murphy

    Copyright 1996, 1999-2006.All rights reserved.

  • 8/2/2019 Sequence Comparison With Dot Matrices

    2/30

    Sequence Alignment

    Definition: Procedure for comparing two or

    more sequences by searching for a series of

    individual characters or character patternsthat are in the same orderin the sequences

    Pair-wise alignment: compare two sequences

    Multiple sequence alignment: compare morethan two sequences

  • 8/2/2019 Sequence Comparison With Dot Matrices

    3/30

    Example sequence alignment

    Task: align abcdef with abdgf

    Write second sequence below the first

    abcdefabdgf

    Move sequences to give maximum match betweenthem

    Show characters that match using vertical bar

  • 8/2/2019 Sequence Comparison With Dot Matrices

    4/30

    Example sequence alignment

    abcdef

    ||abdgf

    Insert gap between b and d on lower

    sequence to allow d and f to align

  • 8/2/2019 Sequence Comparison With Dot Matrices

    5/30

    Example sequence alignment

    abcdef

    || | |ab-dgf

  • 8/2/2019 Sequence Comparison With Dot Matrices

    6/30

    Example sequence alignment

    abcdef

    || | |ab-dgf

    Note e and gdont match

  • 8/2/2019 Sequence Comparison With Dot Matrices

    7/30

    Matching Similarity vs. Identity

    Alignments can be based on finding only

    identical characters, or (more commonly)

    can be based on findingsimilarcharacters

    More on how to definesimilarity later

  • 8/2/2019 Sequence Comparison With Dot Matrices

    8/30

    Global vs. Local Alignment

    We distinguish

    Global alignment algorithms which optimize

    overallalignment between two sequences

    Local alignment algorithms which seek only

    relatively conservedpieces of sequence

    Alignment stops at the ends of regions of strong

    similarity

    Favors finding conserved patterns in otherwise

    different pairs of sequences

  • 8/2/2019 Sequence Comparison With Dot Matrices

    9/30

    Global vs. Local Alignment

    Global

    LGPSSKQTGKGS-SRIWDN

    | | ||| | |LN-ITKSAGKGAIMRLGDA

    Local

    --------GKG--------

    |||--------GKG--------

  • 8/2/2019 Sequence Comparison With Dot Matrices

    10/30

  • 8/2/2019 Sequence Comparison With Dot Matrices

    11/30

    Why do sequence alignments?

    To find whether two (or more) genes or

    proteins are evolutionarily related to each

    other

    To find structurally or functionally similar

    regions within proteins

  • 8/2/2019 Sequence Comparison With Dot Matrices

    12/30

    Origin of similar genes

    Similar genes arise bygene duplication

    Copy of a gene insertednext to the original

    Two copies mutateindependently

    Each can take on separatefunctions

    All or part can betransferred from one part

    of genome to another

  • 8/2/2019 Sequence Comparison With Dot Matrices

    13/30

    Methods for Pairwise Alignment

    Dot matrix analysis

    Dynamic Programming

    Word ork-tuple methods (FASTA and

    BLAST)

  • 8/2/2019 Sequence Comparison With Dot Matrices

    14/30

    Sequence comparison with dot

    matrices Goal: Graphically display regions of

    similarity between two sequences (e.g.,

    domains in common between two proteinsof suspected similar function)

  • 8/2/2019 Sequence Comparison With Dot Matrices

    15/30

    Sequence comparison with dot

    matrices Basic Method: For two sequences of

    lengths M and N, lay out an M by N grid

    (matrix) with one sequence across the topand one sequence down the left side. For

    each position in the grid, compare the

    sequence elements at the top (column) andto the left (row). If and only if they are the

    same, place a dot at that position.

  • 8/2/2019 Sequence Comparison With Dot Matrices

    16/30

    Examples for protein sequences

    (Demonstration A6, Sequence 1 vs. 2)

    (Demonstration A6, Sequence 2 vs. 3)

  • 8/2/2019 Sequence Comparison With Dot Matrices

    17/30

    Interpretation of dot matrices

    Regions of similarity appear as diagonal

    runs of dots

    Reverse diagonals (perpendicular todiagonal) indicate inversions

    Reverse diagonals crossing diagonals (Xs)

    indicate palindromes(Demonstration A6, Sequence 4 vs. 4)

  • 8/2/2019 Sequence Comparison With Dot Matrices

    18/30

    Interpretation of dot matrices

    Can link or "join" separate diagonals to

    form alignment with "gaps"

    Each a.a. or base can only be used onceCan't trace vertically or horizontally

    Can't double back

    A gap is introduced by each vertical orhorizontal skip

  • 8/2/2019 Sequence Comparison With Dot Matrices

    19/30

    Uses for dot matrices

    Can use dot matrices to align two proteins

    or two nucleic acid sequences

    Can use to find amino acid repeats within aprotein by comparing a protein sequence to

    itself

    Repeats appear as a set of diagonal runs stackedvertically and/or horizontally

    (Demonstration A6, Sequence 5 vs. 6)

  • 8/2/2019 Sequence Comparison With Dot Matrices

    20/30

    Uses for dot matrices

    Can use to find self base-pairing of an RNA

    (e.g., tRNA) by comparing a sequence to

    itself complemented and reversed Excellent approach for finding sequence

    transpositions

  • 8/2/2019 Sequence Comparison With Dot Matrices

    21/30

    Filtering to remove noise

    A problem with dot matrices for long

    sequences is that they can be very noisy due

    to lots of insignificant matches (i.e., one A) Solution use a window and a threshold

    compare character by character within a

    window (have to choose window size)require certain fraction of matches within

    window in order to display it with a dot

  • 8/2/2019 Sequence Comparison With Dot Matrices

    22/30

    Example spreadsheet with

    window (Demonstration A7)

  • 8/2/2019 Sequence Comparison With Dot Matrices

    23/30

    How do we choose a window

    size? Window size changes with goal of analysis

    size of average exon

    size of average protein structural element

    size of gene promoter

    size of enzyme active site

  • 8/2/2019 Sequence Comparison With Dot Matrices

    24/30

    How do we choose a threshold

    value? Threshold based on statistics

    using shuffled actual sequence

    find average (m) and s.d. () of match scores ofshuffled sequence

    convert original (unshuffled) scores (x) toZscores

    Z = (x - m)/

    use threshold Z of of 3 to 6

    using analysis of other sets of sequences

    provides objective standard of significance

  • 8/2/2019 Sequence Comparison With Dot Matrices

    25/30

    Dot matrix analysis with DNA

    Strider (Mount, Fig 3.4) Get phage l cI and phage P22 c2 repressor

    sequences from Genbank (X00166 and

    V01153 respectively) Use DNA Strider 1.4 (contact TA to get a

    copy)

    Use window size of 11 and stringency of 7

  • 8/2/2019 Sequence Comparison With Dot Matrices

    26/30

    Dot matrix (Mount Fig 3.4)

    Note set ofdiagonals

    in lowerright thatdo not lineup due to

    insertionnear 475on cI

    100

    100

    200

    200

    300

    300

    400

    400

    500

    500

    600

    600

    100 100

    200 200

    300 300

    400 400

    500 500

    600 600

    700 700

  • 8/2/2019 Sequence Comparison With Dot Matrices

    27/30

    Dot matrix analysis with DNA

    Strider (Mount, Fig 3.6) Get human LDL receptor protein sequence

    from Genbank (P01130)

    Use weighting Identity

    Use window size of 1 and stringency of 1

    Use window size of 23 and stringency of 7

  • 8/2/2019 Sequence Comparison With Dot Matrices

    28/30

    Dot matrix (Mount Fig 3.6)

    W=1 S=1

    Note set of

    stackeddiagonalsin upperleft

    100

    100

    200

    200

    300

    300

    400

    400

    500

    500

    600

    600

    700

    700

    800

    800

    100 100

    200 200

    300 300

    400 400

    500 500

    600 600

    700 700

    800 800

  • 8/2/2019 Sequence Comparison With Dot Matrices

    29/30

    Dot matrix (Mount Fig 3.6)

    W=23 S=7

    Note set of

    stackeddiagonalsin upperleft

    100

    100

    200

    200

    300

    300

    400

    400

    500

    500

    600

    600

    700

    700

    800

    800

    100 100

    200 200

    300 300

    400 400

    500 500

    600 600

    700 700

    800 800

  • 8/2/2019 Sequence Comparison With Dot Matrices

    30/30

    Reading for next class

    Mount, Chapter 3 through page 93

    Look over paper by Needleman and

    Wunsch on web site

    (03-510/710) Durbin et al, pp 17-32