computer science basics for bioinformatics wen-lian hsu 許聞廉 institute of information science
TRANSCRIPT
Computer Science Basics for Bioinformatics
Wen-Lian Hsu
許聞廉Institute of Information Sciencehttp://www.iis.sinica.edu.tw/IASL/hsu/index.html
Outline1. Information Retrieval2. Algorithms
1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems
3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)
Information Retrieval
• Indexing– Designing a spelling checker
absurb absorb
mistyped really wanted– How fast can the system retrieve all
potential candidates? absurd, absorb ….– What if he mistyped this as
cbsorb
Indexing• Create a database that lists all words
containing
a absorb, absurd, …
b boy, by, absorb, …
s sing, absorb, school, …
o go, absorb, origin, …
r acquire, absorb, …• Which words contain “a,b,s,o,r” ?• When someone typed “cbsorb”
– Which words contain “c, b,s,o,r” or at least 4 out of these 5 letters ?
Indexing for Chinese Texts(I)
• Treat each web page as an entity, one can index them by characters:
陳 p1, p5, p9, p13, …
水 p4, p5, p10, p13, p20, …
扁 p5, p8, p13, p25, …
• If someone wants a page containing 「陳水扁」 , he might find it in p5, p13
– However, he could also get a page with 「陳萬水乘坐扁舟」
Indexing for Chinese Texts(II)
• Indexing by two characters (bi-gram)
陳水 p5, p54, p125, …
水扁 p54, p89, p236, …
• However, if someone wants 「台大」 , he could get 「這台大冰箱」
• There are also many other problems:– Synonyms 阿扁,陳總統,台灣大學,臺灣大學– Misspelling 程總統– Semantics 陳萬水的先生
Semantic Annotations
• In order to facilitate our search, we need to find better “indexing schemes”– Schemes that are semantically oriented
• 陳水扁於昨日搭機赴南美訪問友好國家
總統 出國訪問
Treating Genomic/Proteomic data as a Language
• An analogy of exons and introns
Onlyaksjcbakamcnabddfkjsmallddkdfjwos
perddtrudjfdksjascdcentagedkjfdkdfjgaof
humanzidkenkdjfDNAisbelskdfjactuallyof
Snadkfjkjdmeandkfjdkslasdkingful
Decoding an unknown language
• For proteomic data:
Amino acid motif protein
Alphabet word sentence
Sentence meaning
Protein structure
• Finding the interrelationships of data– Data Mining, Knowledge Discovery
DNA intron-exon structure
promoter
轉錄起始點
5’ UTR exon intron Donor site
Acceptor site
Splice site
Start Codon Stop Codon
3’ UTR PolyA
Matching by templatesBoundaries of Splicing Sites
cydegg i scyedgg i scyeegg i tcyhgdggscy rgdgn t
regular expression
c - y - [x2] – [dg] – g – [x] – [st]
Biological PatternsConserved area in multiple alignment
Matching by examples
• Existing sentences in database (understood):– His old father gave me a book.– Joan loves Andy
• UnderstandingUnderstanding a new sentence– Mary’s lovely daughter does not like John
• Techniques– Corpus analysis– Pattern discovery and matching
• Sequence, semantics (classification, transformation)
– Structure prediction
Procedure Automation Protein structure prediction
• Given a sequence, predict its structure automatically1. Find homologous (> 25%) sequences 2. If we can find one whose structure is known, then
carry out an automated homology modeling3. Otherwise, transform our sequence into other
representation (2ndary or super-secondary structure)• IAMHSUWENLAI -----> HHHCCBBBB
4. Align the transformed sequence5. If none works, go back to the “ab initio” approach6. With structure available, scan the catalytic fragments,
ligand binding sites (need 3D active site database)
GenBankWEB
SoftwareRobot
ReconciliationAgents
GenomeInfo Agent
NucleotideInfo Agent
ProteinInfo Agent
DDBJWEB
EMBLWEB
PubMedWEB
SGDWEB
ReconciliationAgents
ReconciliationAgents
SoftwareRobot
SoftwareRobot
SoftwareRobot
SoftwareRobot
SoftwareRobot
SoftwareRobot
Information food chain (Knowledge Management)
Knowledge Map
Outline1. Information Retrieval2. Algorithms
1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems
3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)
Basic Notations
• The big-O notation• “O” stands for “order of magnitude” • O(n) reads “order n”• f (n) = O(n) means
f (n) c n for some constant c
• Time complexity of an algorithm– The time needed by an algorithm in terms
of its input size (usually denoted by n)– For example, O(n), O(n2)
Outline1. Information Retrieval2. Algorithms
1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems
3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)
Subproblem Subproblem
Original Problem
Answer Answer Answer Answer
Subproblem Subproblem Subproblem Subproblem
Combined answer Combined answer
Final Combined Answer
Divide-and-Conquer
• Break the problem into several smaller, similar subproblems. Solve the subproblems recursively, and then combine these solutions to create a solution to the original problem.
• Each level of the recursion consists of three steps: divide, conquer, and combine.
32, 44, 15, 6 28, 43, 17, 53
Merge Sort Example
32, 44, 15, 6, 28, 43, 17, 53
32, 44 15, 6 28, 43 17, 53
Merge Sort Example (cont.)
32, 44 15, 6 28, 43 17, 53
32, 44 6, 15 28, 43 17, 53
?, ?, ?, ?
32, 44 6, 1532, 44 6, 15
6
32, 44 6, 15
6
32, 44 6, 15
6, 15
32, 44 6, 15
6, 15, 32
32, 44 6, 15
6, 15, 32, 44
28, 43 17, 5328, 43 17, 53
17
28, 43 17, 53
17
28, 43 17, 53
17, 28
28, 43 17, 53
17, 28
28, 43 17, 53
17, 28, 43
28, 43 17, 53
17, 28, 43, 536, 15, 32, 44 17, 28, 43, 53
6, 15, 17, 28, 32, 43, 44, 53
Philosophy• Divide:
– divide the n-element sequence into two subsequences of n/2 elements each.
• Conquer:– sort the two subsequence recursively using
merge sort.
• Combine:– Merge the two sorted subsequences to
produce the sorted answer.
Merge Sort Analysis
• Time Analysis:– Total_time = (Time required for each
subproblem) + (Combination time)
• T(n) = O(nlog n)
2 n if,
2 n if,
)()2/(2
)1()(
nOnT
OnT
Outline1. Information Retrieval2. Algorithms
1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems
3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)
Dynamic Programming
S
A
B
C
D
E
F
G
H
T15
18
3
1110
91
2
14
16
2141
3
21
27
Dynamic Programming• Decompose a large problem into sub-
problems
• Each sub-problem is identical to the original problem except the size is smaller
• Use the same strategy to solve sub-problems and store answers in a table
• Combine solutions of the sub-problems by table “look-up”
Example Multiplying chain Matrices
M = A B [10 20] [20 50]
# of multiplications = 10 20 50
• How to multiply
M = A B C D [13 5] [5 89] [89 3] [3 34]
One Possible Scenario• M = A B C D
[13 5] [5 89] [89 3] [3 34]
M = ( ( A B ) C ) D(AB): # of multiplications = 5,785 (AB)C: # of multiplications = 3,471 ((AB)C)D: # of multiplications = 1,326
total # of multiplications = 10,582
All cases
M = A B C D [13 5] [5 89] [89 3] [3
34]
# of multiplications
((AB)C) D: 10,582(AB) (CD): 54,201(A(BC)) D: 2,856A ((BC)D): 4,055A (B(CD)): 26,418
• How do we find the best way to multiply
Mi Mi+1 …Mj (denote the cost by mij )
• Mi … Mk Mk+1 …Mj
# of multiplications = mik + mk+1,j + pi-1 pk pj
• Therefore, mij = min { mik+mk+1,j+ pi-1 pk pj }
The matrix-chain multiplication problem
mik mk+1,j
k
Table look-up (example)
m11=0 m22=0 m33=0 m44=0
m12=10,000 m23=1000 m34=05000
m13=1200 m24=3000
m14=2200
100120
1005020min
4423
3422
24 mm
mmm
15010
12010min
3312
2311
13 mm
mmm
100110
1005010
1002010
min
4413
3412
2411
14
mm
mm
mm
m
M = M1 M2 M3 M4
[10 20] [20 50] [50 1] [1 100]
Table look-up (general)M = M1 M2 … Mn
m11 m22 m33 … mnn
M(n-1)n…m23m12
m13 m24 …
… …
M1(n-1) m2n
m1n
Outline1. Information Retrieval2. Algorithms
1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems
3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)
NP-Complete Problems
• Complexity of problems– Polynomial O(nk)
• Merge sort• Longest common subsequence
– NP-complete• Not known to be polynomially solvable• The TSP
– Exponential
The Traveling Salesman Problem
The Traveling Salesman Problem (TSP)
• A salesman spends his time visiting n cities cyclically.
• In one tour he visits each city exactly once, and finishes up where he started.
• In what order should he visit the cities to minimize the distance traveled?
Candidates grows exponentially
• 3 cities 1 solution.
• 10 cities 181,440 possible tours
• n cities (n-1)!/2 possible tours
• An optimal solution of the (n-1)-city problem could be useless for the n-city problem.4 cities 5 cities
Reduction is difficult
4
7
85
3
6
3
3
85
5
5 64
73
Approximation Algorithms
• Instead of finding a best solution, one could settle for a sub-optimal solution
• Other types of algorithms:– Branch-and-bound, Genetic algorithm,
non-linear programming, numerical algorithms
– For classification: neural net, support vector machine
A 50x50 matrix with error rate 5%
111111111111111111N11
111N1111 111111111111111111 1N11111N111111111111
1111111111111 111111
11N11111111111111111 11111111111111111111
1111111111111111
111111111111111111 11111111111111N1
11111111111 111N1111111111N
11111N11111 1111111111111111N11
1111111111111111111111111N111
1111111111111N11111 111111N111111111
11111111111111111111 11111111111111111
111111 111111111N11
11111111111111 11111111
11111111111111111 111111111N11
1111111111 1N1111111111N11111 P
N111N111111111 N1N11111
P 1111111111111111 1111111N11N111111
11111111111111 11111111N111111
11N1111111 N1N111
111111111111111111 1111111111111
11111111111111 P 1111111
11111N111111111 111111111111111111
1N111111111111 111111N11
11111111111111111111 1111111
111111111111111
11111111 1111111111F1F1
111F1111 111111111111111111 1F11111F111111111111
1111111111111 111111
11F11111111111111111 11111111111111111111
1111111 111111111
111111111111111111 111111111111111
11111111111 111F1111111111
111111F1111 1111111111111111F11
11111111111111 11111111111F111
11FF11111111111F1111F1 111111F111111111
11111F11111111111111 11111F11111111111
11111 1111111111F1
11111111111111 11111111
1111111111111111 1111111111
1111111111 11111111111F11111
11F1111111111 111111
1F1111111111111111 1111111F11F111111
11F111111111111 11111111F111111
1F11111111 1F111 111111111111111111
1111111111111 11111111111111
11111111 11111F11111111F1
111111111111111111 1F111111111111
111111F11 11111111111111111111
1111111 11F1111111111111
Outline1. Information Retrieval2. Algorithms
1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems
3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)
Random Variables
• Suppose we toss 3 fair coins. If we let X denote the number of heads appearing, then X is a random variable taking on one of the values 0, 1, 2, 3 with respective probabilities:
P{ X=0 } = P{(T, T, T)} = 1/8P{ X=1 } = P{(T, T, H), (T, H, T), (H, T, T)} = 3/8P{ X=2 } = P{(T, H, H), (H, T, H), (H, H, T)} = 3/8P{ X=3 } = P{(H, H, H)} = 1/8
Joint Distribution
• Suppose that 2 balls are randomly selected from an urn containing 2 red, 3 white, and 5 blues balls.
• If we let X and Y denote the number of red and white balls chosen, then the joint probability density function of X and Y is
p(i,j) = P{X=i, Y=j}
Joint Distribution
• red:2, white:3, blue:5
p(0, 0) =
p(0, 1) =
p(0, 2) =
p(1, 0) =
p(1, 1) =
p(2, 0) =
45/10/ 102
52 CC
45/15/ 102
51
31 CCC
45/3/ 102
32 CC
45/1/ 102
22 CC
45/10/ 102
51
21 CCC
45/6/ 102
31
21 CCC
Conditional Probabilities
• An urn contains 2 red, 3 white, and 5 blue balls.
• A ball is chosen at random from the urn, and it is noted that it is not one of the blue balls. What is the probability that it is white?
Conditional Probabilities
• red: 2, white: 3, blue: 5
P(White | Not Blue)
= P(White) / P(NotBlue)
=
= 3/5105
103
Outline1. Information Retrieval2. Algorithms
1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems
3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)
Markov Process• A study process
• x1 = primary schools that one studies in.• x2 = junior high schools• x3 = senior high schools• x4 = universities• x5 = graduate schools
• A sequence of random variables x1, x2, x3, …
P(Xi = | Xi-1= xi-1, Xi-2= xi-2, Xi-3= xi-3)
= P(Xi= xi | Xi-1= xi-1)
Outline1. Information Retrieval2. Algorithms
1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems
3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)
Hidden Markov ModelAn example
• An occasionally dishonest casino
= {1, 2, 3, 4, 5, 6}Q = {F, U}aFF = 0.95aFU = 0.05aUU = 0.9aUF = 0.1eF(1) = 1/6 eU(1) = 1/10eF(2) = 1/6 eU(2) = 1/10eF(3) = 1/6 eU(3) = 1/10eF(4) = 1/6 eU(4) = 1/10eF(5) = 1/6 eU(5) = 1/10eF(6) = 1/6 eU(6) = 1/2
1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6
1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2
0.95
0.05
0.1
0.9
UF
Observation Prediction
Rolls 246446644245311321631164152133625144543631656626566666
Hidden FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUUUUUUUUUPredict FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUUUUUU
Rolls 651166453112456366646316366631623264552362666666251516
Hidden UUUUUUFFFFFFFFUUUUUUUUUUUUUUUUFFFUUUUUUUUUUUUUUFFFFFFFPredict UUUUUUFFFFFFFFUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUFFFFFF
Rolls 222555441666566563564324364131513465146126414626253356
Hidden FFFFFFFFUUUUUUUUUUUUUFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFUUPredict FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFU
Definition
• A Hidden Markov Model (HMM) is a triple M = (, Q, ), where: is an alphabet of symbols.– Q is a finite set of states capable of
emitting symbols from . is a set of probabilities, comprised of
• State transition probabilities
(akl, k, l Q).• State Emission probabilities
(ek(b), k Q, b )
• A state path = (1, ..., L) in the model M is a sequence of states.
• For a sequence x = (x1, ..., xL) *, the probability that x was generated by M based on the state path is
P(x,) =
The Joint Probability P(x,)
L
ii iiiaxea
1110
)(
A Simple HMMAn alternating sequence of exons and introns
Exon Intron
A 0.4C 0.1G 0.1T 0.4
A 0.05C 0.4G 0.5T 0.05
hidden
Observation
… …E E E E I I I E E E
A T C A A G G C G T
0.9
0.1
0.01
0.99