computer science basics for bioinformatics wen-lian hsu 許聞廉 institute of information science

Computer Science Basics for Bioinformatics

Wen-Lian Hsu

許聞廉Institute of Information Sciencehttp://www.iis.sinica.edu.tw/IASL/hsu/index.html

Outline1. Information Retrieval2. Algorithms

1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems

3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)

Information Retrieval

• Indexing– Designing a spelling checker

absurb absorb

mistyped really wanted– How fast can the system retrieve all

potential candidates? absurd, absorb ….– What if he mistyped this as

cbsorb

Indexing• Create a database that lists all words

containing

a absorb, absurd, …

b boy, by, absorb, …

s sing, absorb, school, …

o go, absorb, origin, …

r acquire, absorb, …• Which words contain “a,b,s,o,r” ?• When someone typed “cbsorb”

– Which words contain “c, b,s,o,r” or at least 4 out of these 5 letters ?

Indexing for Chinese Texts(I)

• Treat each web page as an entity, one can index them by characters:

陳 p1, p5, p9, p13, …

水 p4, p5, p10, p13, p20, …

扁 p5, p8, p13, p25, …

• If someone wants a page containing 「陳水扁」 , he might find it in p5, p13

– However, he could also get a page with 「陳萬水乘坐扁舟」

Indexing for Chinese Texts(II)

• Indexing by two characters (bi-gram)

陳水 p5, p54, p125, …

水扁 p54, p89, p236, …

• However, if someone wants 「台大」 , he could get 「這台大冰箱」

• There are also many other problems:– Synonyms 阿扁，陳總統，台灣大學，臺灣大學– Misspelling 程總統– Semantics 陳萬水的先生

Semantic Annotations

• In order to facilitate our search, we need to find better “indexing schemes”– Schemes that are semantically oriented

• 陳水扁於昨日搭機赴南美訪問友好國家

總統出國訪問

Treating Genomic/Proteomic data as a Language

• An analogy of exons and introns

Onlyaksjcbakamcnabddfkjsmallddkdfjwos

perddtrudjfdksjascdcentagedkjfdkdfjgaof

humanzidkenkdjfDNAisbelskdfjactuallyof

Snadkfjkjdmeandkfjdkslasdkingful

Decoding an unknown language

• For proteomic data:

Amino acid motif protein

Alphabet word sentence

Sentence meaning

Protein structure

• Finding the interrelationships of data– Data Mining, Knowledge Discovery

DNA intron-exon structure

promoter

轉錄起始點

5’ UTR exon intron Donor site

Acceptor site

Splice site

Start Codon Stop Codon

3’ UTR PolyA

Matching by templatesBoundaries of Splicing Sites

cydegg i scyedgg i scyeegg i tcyhgdggscy rgdgn t

regular expression

c - y - [x2] – [dg] – g – [x] – [st]

Biological PatternsConserved area in multiple alignment

Matching by examples

• Existing sentences in database (understood):– His old father gave me a book.– Joan loves Andy

• UnderstandingUnderstanding a new sentence– Mary’s lovely daughter does not like John

• Techniques– Corpus analysis– Pattern discovery and matching

• Sequence, semantics (classification, transformation)

– Structure prediction

Procedure Automation Protein structure prediction

• Given a sequence, predict its structure automatically1. Find homologous (> 25%) sequences 2. If we can find one whose structure is known, then

carry out an automated homology modeling3. Otherwise, transform our sequence into other

representation (2ndary or super-secondary structure)• IAMHSUWENLAI -----> HHHCCBBBB

4. Align the transformed sequence5. If none works, go back to the “ab initio” approach6. With structure available, scan the catalytic fragments,

ligand binding sites (need 3D active site database)

GenBankWEB

SoftwareRobot

ReconciliationAgents

GenomeInfo Agent

NucleotideInfo Agent

ProteinInfo Agent

DDBJWEB

EMBLWEB

PubMedWEB

SGDWEB



SoftwareRobot

SoftwareRobot

SoftwareRobot

SoftwareRobot

SoftwareRobot

SoftwareRobot

Information food chain (Knowledge Management)

Knowledge Map

Basic Notations

• The big-O notation• “O” stands for “order of magnitude” • O(n) reads “order n”• f (n) = O(n) means

f (n) c n for some constant c

• Time complexity of an algorithm– The time needed by an algorithm in terms

of its input size (usually denoted by n)– For example, O(n), O(n2)

Subproblem Subproblem

Original Problem

Answer Answer Answer Answer

Subproblem Subproblem Subproblem Subproblem

Combined answer Combined answer

Final Combined Answer

Divide-and-Conquer

• Break the problem into several smaller, similar subproblems. Solve the subproblems recursively, and then combine these solutions to create a solution to the original problem.

• Each level of the recursion consists of three steps: divide, conquer, and combine.

32, 44, 15, 6 28, 43, 17, 53

Merge Sort Example

32, 44, 15, 6, 28, 43, 17, 53

32, 44 15, 6 28, 43 17, 53

Merge Sort Example (cont.)

32, 44 15, 6 28, 43 17, 53

32, 44 6, 15 28, 43 17, 53

?, ?, ?, ?

32, 44 6, 1532, 44 6, 15

6

32, 44 6, 15

6

32, 44 6, 15

6, 15

32, 44 6, 15

6, 15, 32

32, 44 6, 15

6, 15, 32, 44

28, 43 17, 5328, 43 17, 53

17

28, 43 17, 53

17

28, 43 17, 53

17, 28

28, 43 17, 53

17, 28

28, 43 17, 53

17, 28, 43

28, 43 17, 53

17, 28, 43, 536, 15, 32, 44 17, 28, 43, 53

6, 15, 17, 28, 32, 43, 44, 53

Philosophy• Divide:

– divide the n-element sequence into two subsequences of n/2 elements each.

• Conquer:– sort the two subsequence recursively using

merge sort.

• Combine:– Merge the two sorted subsequences to

produce the sorted answer.

Merge Sort Analysis

• Time Analysis:– Total_time = (Time required for each

subproblem) + (Combination time)

• T(n) = O(nlog n)

2 n if,

2 n if,

)()2/(2

)1()(

nOnT

OnT

Dynamic Programming

S

A

B

C

D

E

F

G

H

T15

18

3

1110

91

2

14

16

2141

3

21

27

Dynamic Programming• Decompose a large problem into sub-

problems

• Each sub-problem is identical to the original problem except the size is smaller

• Use the same strategy to solve sub-problems and store answers in a table

• Combine solutions of the sub-problems by table “look-up”

Example Multiplying chain Matrices

M = A B [10 20] [20 50]

# of multiplications = 10 20 50

• How to multiply

M = A B C D [13 5] [5 89] [89 3] [3 34]

One Possible Scenario• M = A B C D

[13 5] [5 89] [89 3] [3 34]

M = ( ( A B ) C ) D(AB): # of multiplications = 5,785 (AB)C: # of multiplications = 3,471 ((AB)C)D: # of multiplications = 1,326

total # of multiplications = 10,582

All cases

M = A B C D [13 5] [5 89] [89 3] [3

34]

# of multiplications

((AB)C) D: 10,582(AB) (CD): 54,201(A(BC)) D: 2,856A ((BC)D): 4,055A (B(CD)): 26,418

• How do we find the best way to multiply

Mi Mi+1 …Mj (denote the cost by mij )

• Mi … Mk Mk+1 …Mj

# of multiplications = mik + mk+1,j + pi-1 pk pj

• Therefore, mij = min { mik+mk+1,j+ pi-1 pk pj }

The matrix-chain multiplication problem

mik mk+1,j

k

Table look-up (example)

m11=0 m22=0 m33=0 m44=0

m12=10,000 m23=1000 m34=05000

m13=1200 m24=3000

m14=2200

100120

1005020min

4423

3422

24 mm

mmm

15010

12010min

3312

2311

13 mm

mmm

100110

1005010

1002010

min

4413

3412

2411

14

mm

mm

mm

m

M = M1 M2 M3 M4

[10 20] [20 50] [50 1] [1 100]

Table look-up (general)M = M1 M2 … Mn

m11 m22 m33 … mnn

M(n-1)n…m23m12

m13 m24 …

… …

M1(n-1) m2n

m1n

NP-Complete Problems

• Complexity of problems– Polynomial O(nk)

• Merge sort• Longest common subsequence

– NP-complete• Not known to be polynomially solvable• The TSP

– Exponential

The Traveling Salesman Problem

The Traveling Salesman Problem (TSP)

• A salesman spends his time visiting n cities cyclically.

• In one tour he visits each city exactly once, and finishes up where he started.

• In what order should he visit the cities to minimize the distance traveled?

Candidates grows exponentially

• 3 cities 1 solution.

• 10 cities 181,440 possible tours

• n cities (n-1)!/2 possible tours

• An optimal solution of the (n-1)-city problem could be useless for the n-city problem.4 cities 5 cities

Reduction is difficult

4

7

85

3

6

3

3

85

5

5 64

73

Approximation Algorithms

• Instead of finding a best solution, one could settle for a sub-optimal solution

• Other types of algorithms:– Branch-and-bound, Genetic algorithm,

non-linear programming, numerical algorithms

– For classification: neural net, support vector machine

A 50x50 matrix with error rate 5%

111111111111111111N11

111N1111 111111111111111111 1N11111N111111111111

1111111111111 111111

11N11111111111111111 11111111111111111111

1111111111111111

111111111111111111 11111111111111N1

11111111111 111N1111111111N

11111N11111 1111111111111111N11

1111111111111111111111111N111

1111111111111N11111 111111N111111111

11111111111111111111 11111111111111111

111111 111111111N11

11111111111111 11111111

11111111111111111 111111111N11

1111111111 1N1111111111N11111 P

N111N111111111 N1N11111

P 1111111111111111 1111111N11N111111

11111111111111 11111111N111111

11N1111111 N1N111

111111111111111111 1111111111111

11111111111111 P 1111111

11111N111111111 111111111111111111

1N111111111111 111111N11

11111111111111111111 1111111

111111111111111

11111111 1111111111F1F1

111F1111 111111111111111111 1F11111F111111111111

1111111111111 111111

11F11111111111111111 11111111111111111111

1111111 111111111

111111111111111111 111111111111111

11111111111 111F1111111111

111111F1111 1111111111111111F11

11111111111111 11111111111F111

11FF11111111111F1111F1 111111F111111111

11111F11111111111111 11111F11111111111

11111 1111111111F1

11111111111111 11111111

1111111111111111 1111111111

1111111111 11111111111F11111

11F1111111111 111111

1F1111111111111111 1111111F11F111111

11F111111111111 11111111F111111

1F11111111 1F111 111111111111111111

1111111111111 11111111111111

11111111 11111F11111111F1

111111111111111111 1F111111111111

111111F11 11111111111111111111

1111111 11F1111111111111

Random Variables

• Suppose we toss 3 fair coins. If we let X denote the number of heads appearing, then X is a random variable taking on one of the values 0, 1, 2, 3 with respective probabilities:

P{ X=0 } = P{(T, T, T)} = 1/8P{ X=1 } = P{(T, T, H), (T, H, T), (H, T, T)} = 3/8P{ X=2 } = P{(T, H, H), (H, T, H), (H, H, T)} = 3/8P{ X=3 } = P{(H, H, H)} = 1/8

Joint Distribution

• Suppose that 2 balls are randomly selected from an urn containing 2 red, 3 white, and 5 blues balls.

• If we let X and Y denote the number of red and white balls chosen, then the joint probability density function of X and Y is

p(i,j) = P{X=i, Y=j}

Joint Distribution

• red:2, white:3, blue:5

p(0, 0) =

p(0, 1) =

p(0, 2) =

p(1, 0) =

p(1, 1) =

p(2, 0) =

45/10/ 102

52 CC

45/15/ 102

51

31 CCC

45/3/ 102

32 CC

45/1/ 102

22 CC

45/10/ 102

51

21 CCC

45/6/ 102

31

21 CCC

Conditional Probabilities

• An urn contains 2 red, 3 white, and 5 blue balls.

• A ball is chosen at random from the urn, and it is noted that it is not one of the blue balls. What is the probability that it is white?

Conditional Probabilities

• red: 2, white: 3, blue: 5

P(White | Not Blue)

= P(White) / P(NotBlue)

=

= 3/5105

103

Markov Process• A study process

• x1 = primary schools that one studies in.• x2 = junior high schools• x3 = senior high schools• x4 = universities• x5 = graduate schools

• A sequence of random variables x1, x2, x3, …

P(Xi = | Xi-1= xi-1, Xi-2= xi-2, Xi-3= xi-3)

= P(Xi= xi | Xi-1= xi-1)

Hidden Markov ModelAn example

• An occasionally dishonest casino

= {1, 2, 3, 4, 5, 6}Q = {F, U}aFF = 0.95aFU = 0.05aUU = 0.9aUF = 0.1eF(1) = 1/6 eU(1) = 1/10eF(2) = 1/6 eU(2) = 1/10eF(3) = 1/6 eU(3) = 1/10eF(4) = 1/6 eU(4) = 1/10eF(5) = 1/6 eU(5) = 1/10eF(6) = 1/6 eU(6) = 1/2

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

0.95

0.05

0.1

0.9

UF

Observation Prediction

Rolls 246446644245311321631164152133625144543631656626566666

Hidden FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUUUUUUUUUPredict FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUUUUUU

Rolls 651166453112456366646316366631623264552362666666251516

Hidden UUUUUUFFFFFFFFUUUUUUUUUUUUUUUUFFFUUUUUUUUUUUUUUFFFFFFFPredict UUUUUUFFFFFFFFUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUFFFFFF

Rolls 222555441666566563564324364131513465146126414626253356

Hidden FFFFFFFFUUUUUUUUUUUUUFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFUUPredict FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFU

Definition

• A Hidden Markov Model (HMM) is a triple M = (, Q, ), where: is an alphabet of symbols.– Q is a finite set of states capable of

emitting symbols from . is a set of probabilities, comprised of

• State transition probabilities

(akl, k, l Q).• State Emission probabilities

(ek(b), k Q, b )

• A state path = (1, ..., L) in the model M is a sequence of states.

• For a sequence x = (x1, ..., xL) *, the probability that x was generated by M based on the state path is

P(x,) =

The Joint Probability P(x,)

L

ii iiiaxea

1110

)(

A Simple HMMAn alternating sequence of exons and introns

Exon Intron

A 0.4C 0.1G 0.1T 0.4

A 0.05C 0.4G 0.5T 0.05

hidden

Observation

… …E E E E I I I E E E

A T C A A G G C G T

0.9

0.1

0.01

0.99

computer science basics for bioinformatics wen-lian hsu 許聞廉 institute of information science

Documents

protein structurefinding

cbsorbwhich words

web page

better indexing schemesschemes

semantics classification

chinese textsiiindexing

chinese textsitreat

characters bigramp5