pattern and string matching tools biology 162 computational genetics todd vision 9 sep 2004

Post on 04-Jan-2016

218 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Pattern and string matching tools

Biology 162 Computational Genetics

Todd Vision9 Sep 2004

Some more pattern and string matching tools

• Simple signatures– Logos– Position-specific Scoring Matrices– PSI-BLAST

• Regular expressions• Suffix trees

Sequence logos

• Entropy of column j denoted Hj

• Information content denoted Ij

• How to draw a logo– Height of column given by Ij– Height of each symbol = fij x Ij€

H j = − f ij log2( f ij )i

I j = log2 20 −H j

Information content

• Information/Uncertainty is expressed in bits– There is a natural relationship to log base 2

• Imagine 64 shells, under one of which is a ball.– 6 guesses are required to find the ball

– In this case, maximal uncertainty is log264=6 bits

• In the case of 20 amino acids, maximal uncertainty is log220=4.32 bits.

Position-Specific Scoring Matrix

• Constructed from conserved columns of a MSA

• Log odds scores for each residue in each column, based on– Frequency of residue within column– Background frequency of residues

• Takes advantage of the fact that columns differ in– Composition– Levels of conservation

Position Specific Scoring Matrix

pos con A R N D C … A R N D C … Inf Pseu 1 M -1 -3 -3 -4 -1 … 0 0 0 0 0 … 0.50 0.16 2 W -3 -3 -4 -5 -3 … 0 0 0 0 0 … 2.32 0.26 3 I -1 -3 -2 -3 7 … 0 0 0 36 0 … 0.71 0.26 4 L -2 -3 -2 -3 -3 … 0 0 0 0 0 … 0.47 0.35 5 A 4 -2 -2 -2 -2 … 56 0 0 0 0 … 0.52 0.35

PSI-BLAST PSSM for DSCAM

Pseudocounts• If a residue is never seen in a particular

column in of a MSA– What is the probability of ever seeing it there?– Not really zero…

• Pseudocounts are added to actual counts to account for uncertaintly in column frequencies

• Many methods– Laplace’s Rule

• Add one to every count• Psudocounts grow less important as sample size gets

large

– Methods related to Bayesian priors - we will see later

Calculating scores in a PSSM

• Sij is score for residue i at position j

• xij is position-specific count of residue i

• fi is background frequency of residue i

• bij are pseudocounts

• N sequences in alignment

Sij = log2 x ij + bij( ) N + biji

∑ ⎛

⎝ ⎜

⎠ ⎟

−1

f i

⎣ ⎢ ⎢

⎦ ⎥ ⎥

PSI-BLAST

• Can identify more distant homologs than possible via pairwise BLAST

• Iterative BLAST– After 1st iteration, multiple alignment is

computed for query and top matches– PSSM generated from alignment– PSSM used for subsequent iterations– PSSM refined each iteration

PSI-BLAST

• Once high-scoring words are generated from PSSM, algorithm proceeds as before– Still very fast

• and K must be recalculated for each iteration

Regular Expressions (regex)

• Can be thought of as a non-probabilistic rule for generating (or matching) a pattern

• Used for– DNA/Protein signatures (e.g. Prosite)– Text parsing (e.g. in Perl)

Prosite regexesID CBD_FUNGAL; PATTERN.AC PS00562;DT DEC-1991 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (INFO UPDATE).DE Cellulose-binding domain, fungal type.PA C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C

In Perl regex syntax:CGG\w{4,7}G\w{3}C\w{5}C\w{3,5}[NHG]\w[FYWM]\w{2}QC

In words:C followed by G followed by G followed by any 4 to 7 letters

followed by G followed by any 3 letters followed by C followed by any 5 letters followed by C followed by an 3 to 5 letters followed by one of N, H or G, followed by any letter followed by one of F, Y, W, or M followed by any two letters followed by Q followed by C

Perl regex metacharacters• [ ] - character class (e.g. [abc] = a, b or c)• {min, max} - quantifiers• {exactly}• * - repetition, zero or more• + - repetition, one or more• ? - optional, zero or one• . - wildcard (any character)• ( ) - capture or delimit substrings• | - alternation (e.g. (a|b) = either a or b)

Regular expressions

Pattern Matchesa[bc]d abd, acdab{2,5}c abc, abbc, …

abbbbbcab*c ac, abc, abbc, …ab+c abc, abbc, …ab?c ac, abca(bc|de) abc, ade

Regular expressions: limitations

• Non-probabilistic: all matches match equally well– Hidden Markov models improve upon this

• Cannot model dependencies among different positions– Neither can HMMs– For RNA matches, where dependencies

matter, we need to allow more complex rules

Chomsky hierarchy of transformational

grammars: a preview

• General theory for modelling strings of symbols used in linguistics– Regular grammars– Context-free grammars– Context-sensitive grammars– Unrestricted grammars

• Regular grammars (like regexes) are easy to parse, but are structurally limited

• We will see context sensitive grammars for modelling RNA sequences

Suffix Trees

• Data structure used for fast matching of sequence patterns

• Helps to explain how BLAST can find word matches so fast

• Commonly used for – Exact matching– Identifying repeated sequences

Suffix Trees

• Rooted, directed tree for string S• |S| = m leaves, labeled 1..m• Edges labelled with substrings of S• Internal node has at most one

edge for each symbol in alphabet• Concatenation of edge labels on

path from root to leaf i equals suffix S[1..m]

Suffix Trees: An Example

S = ‘gatgac’

root

3 6 5 2 4 1

tgac

c

a

c tgac

ga

tgacc

Least common ancestor• LCA corresponds to shared prefix

of suffix (e.g. path labeled ‘ga’ for nodes 1 and 4)

• LCA can be retrieved in constant time

root

3 6 5 2 4 1

tgac

c

a

c tgac

ga

tgacc

If suffix trees are the answer, what is the

question?• Rapid word matching• Find all occurrences of ‘ga’ in S =

‘gatgac’ root

3 6 5 2 4 1

tgac

c

a

c tgac

ga

tgacc

If suffix trees are the answer, what is the

question?• Longest common substring problem• Find the starting positions, length and

identity of the longest substring that occurs in both S1 and S2

S1 = ‘gatgac’

S2 = ‘gatcac’

root

3 6 5 2 4 1

gac

c

a

cgac

ga

cacc

1

t

gac2

t

cac

3

cac

t

4

ac

56

If suffix trees are the answer, what is the

question?• Find all direct palindromes (a substring

concatenated with its reverse) in S=‘agattagct’ • Observation

– Let Sr=‘tcgattaga’

– If a palindrome is centered between q and q+1 of S, then it is also centered between m-q and m-q+1 of Sr.

• Solution– Construct joint suffix tree for S and Sr, find least

common ancestor for all pairs q+1, n-q+1

Myriad uses for suffix trees

• Direct and inverted repeats– Microsatellites– Transposons

• Inverted palindromes– Restriction enzyme recognition sites

• Imperfect matches

• Algorithmic efficiency – Many efficient algorithms for traversing suffix trees– The trees themselves can be constructed in O(m)

time

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Reading assignment(for Tuesday and

Thursday)• Durbin et al. (1998) pgs. 46-79 in

Biological Sequence Analysis. – Markov chains– Hidden Markov models

top related