computer science basics for bioinformatics wen-lian hsu 許聞廉 institute of information science

56
Computer Science Basics for Bioinformatics Wen-Lian Hsu 許許許 Institute of Information Science http://www.iis.sinica.edu.tw/IASL/hsu/ index.html

Upload: shonda-anne-harmon

Post on 31-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Computer Science Basics for Bioinformatics

Wen-Lian Hsu

許聞廉Institute of Information Sciencehttp://www.iis.sinica.edu.tw/IASL/hsu/index.html

Page 2: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Outline1. Information Retrieval2. Algorithms

1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems

3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)

Page 3: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Information Retrieval

• Indexing– Designing a spelling checker

absurb absorb

mistyped really wanted– How fast can the system retrieve all

potential candidates? absurd, absorb ….– What if he mistyped this as

cbsorb

Page 4: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Indexing• Create a database that lists all words

containing

a absorb, absurd, …

b boy, by, absorb, …

s sing, absorb, school, …

o go, absorb, origin, …

r acquire, absorb, …• Which words contain “a,b,s,o,r” ?• When someone typed “cbsorb”

– Which words contain “c, b,s,o,r” or at least 4 out of these 5 letters ?

Page 5: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Indexing for Chinese Texts(I)

• Treat each web page as an entity, one can index them by characters:

陳 p1, p5, p9, p13, …

水 p4, p5, p10, p13, p20, …

扁 p5, p8, p13, p25, …

• If someone wants a page containing 「陳水扁」 , he might find it in p5, p13

– However, he could also get a page with 「陳萬水乘坐扁舟」

Page 6: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Indexing for Chinese Texts(II)

• Indexing by two characters (bi-gram)

陳水 p5, p54, p125, …

水扁 p54, p89, p236, …

• However, if someone wants 「台大」 , he could get 「這台大冰箱」

• There are also many other problems:– Synonyms 阿扁,陳總統,台灣大學,臺灣大學– Misspelling 程總統– Semantics 陳萬水的先生

Page 7: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Semantic Annotations

• In order to facilitate our search, we need to find better “indexing schemes”– Schemes that are semantically oriented

• 陳水扁於昨日搭機赴南美訪問友好國家

總統 出國訪問

Page 8: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Treating Genomic/Proteomic data as a Language

• An analogy of exons and introns

Onlyaksjcbakamcnabddfkjsmallddkdfjwos

perddtrudjfdksjascdcentagedkjfdkdfjgaof

humanzidkenkdjfDNAisbelskdfjactuallyof

Snadkfjkjdmeandkfjdkslasdkingful

Page 9: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Decoding an unknown language

• For proteomic data:

Amino acid motif protein

Alphabet word sentence

Sentence meaning

Protein structure

• Finding the interrelationships of data– Data Mining, Knowledge Discovery

Page 10: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

DNA intron-exon structure

promoter

轉錄起始點

5’ UTR exon intron Donor site

Acceptor site

Splice site

Start Codon Stop Codon

3’ UTR PolyA

Page 11: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Matching by templatesBoundaries of Splicing Sites

Page 12: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

cydegg i scyedgg i scyeegg i tcyhgdggscy rgdgn t

regular expression

c - y - [x2] – [dg] – g – [x] – [st]

Biological PatternsConserved area in multiple alignment

Page 13: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Matching by examples

• Existing sentences in database (understood):– His old father gave me a book.– Joan loves Andy

• UnderstandingUnderstanding a new sentence– Mary’s lovely daughter does not like John

• Techniques– Corpus analysis– Pattern discovery and matching

• Sequence, semantics (classification, transformation)

– Structure prediction

Page 14: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Procedure Automation Protein structure prediction

• Given a sequence, predict its structure automatically1. Find homologous (> 25%) sequences 2. If we can find one whose structure is known, then

carry out an automated homology modeling3. Otherwise, transform our sequence into other

representation (2ndary or super-secondary structure)• IAMHSUWENLAI -----> HHHCCBBBB

4. Align the transformed sequence5. If none works, go back to the “ab initio” approach6. With structure available, scan the catalytic fragments,

ligand binding sites (need 3D active site database)

Page 15: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

GenBankWEB

SoftwareRobot

ReconciliationAgents

GenomeInfo Agent

NucleotideInfo Agent

ProteinInfo Agent

DDBJWEB

EMBLWEB

PubMedWEB

SGDWEB

ReconciliationAgents

ReconciliationAgents

SoftwareRobot

SoftwareRobot

SoftwareRobot

SoftwareRobot

SoftwareRobot

SoftwareRobot

Information food chain (Knowledge Management)

Knowledge Map

Page 16: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Outline1. Information Retrieval2. Algorithms

1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems

3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)

Page 17: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Basic Notations

• The big-O notation• “O” stands for “order of magnitude” • O(n) reads “order n”• f (n) = O(n) means

f (n) c n for some constant c

• Time complexity of an algorithm– The time needed by an algorithm in terms

of its input size (usually denoted by n)– For example, O(n), O(n2)

Page 18: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Outline1. Information Retrieval2. Algorithms

1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems

3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)

Page 19: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Subproblem Subproblem

Original Problem

Answer Answer Answer Answer

Subproblem Subproblem Subproblem Subproblem

Combined answer Combined answer

Final Combined Answer

Page 20: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Divide-and-Conquer

• Break the problem into several smaller, similar subproblems. Solve the subproblems recursively, and then combine these solutions to create a solution to the original problem.

• Each level of the recursion consists of three steps: divide, conquer, and combine.

Page 21: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

32, 44, 15, 6 28, 43, 17, 53

Merge Sort Example

32, 44, 15, 6, 28, 43, 17, 53

32, 44 15, 6 28, 43 17, 53

Page 22: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Merge Sort Example (cont.)

32, 44 15, 6 28, 43 17, 53

32, 44 6, 15 28, 43 17, 53

?, ?, ?, ?

32, 44 6, 1532, 44 6, 15

6

32, 44 6, 15

6

32, 44 6, 15

6, 15

32, 44 6, 15

6, 15, 32

32, 44 6, 15

6, 15, 32, 44

28, 43 17, 5328, 43 17, 53

17

28, 43 17, 53

17

28, 43 17, 53

17, 28

28, 43 17, 53

17, 28

28, 43 17, 53

17, 28, 43

28, 43 17, 53

17, 28, 43, 536, 15, 32, 44 17, 28, 43, 53

6, 15, 17, 28, 32, 43, 44, 53

Page 23: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Philosophy• Divide:

– divide the n-element sequence into two subsequences of n/2 elements each.

• Conquer:– sort the two subsequence recursively using

merge sort.

• Combine:– Merge the two sorted subsequences to

produce the sorted answer.

Page 24: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Merge Sort Analysis

• Time Analysis:– Total_time = (Time required for each

subproblem) + (Combination time)

• T(n) = O(nlog n)

2 n if,

2 n if,

)()2/(2

)1()(

nOnT

OnT

Page 25: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Outline1. Information Retrieval2. Algorithms

1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems

3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)

Page 26: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Dynamic Programming

S

A

B

C

D

E

F

G

H

T15

18

3

1110

91

2

14

16

2141

3

21

27

Page 27: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Dynamic Programming• Decompose a large problem into sub-

problems

• Each sub-problem is identical to the original problem except the size is smaller

• Use the same strategy to solve sub-problems and store answers in a table

• Combine solutions of the sub-problems by table “look-up”

Page 28: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Example Multiplying chain Matrices

M = A B [10 20] [20 50]

# of multiplications = 10 20 50

• How to multiply

M = A B C D [13 5] [5 89] [89 3] [3 34]

Page 29: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

One Possible Scenario• M = A B C D

[13 5] [5 89] [89 3] [3 34]

M = ( ( A B ) C ) D(AB): # of multiplications = 5,785 (AB)C: # of multiplications = 3,471 ((AB)C)D: # of multiplications = 1,326

total # of multiplications = 10,582

Page 30: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

All cases

M = A B C D [13 5] [5 89] [89 3] [3

34]

# of multiplications

((AB)C) D: 10,582(AB) (CD): 54,201(A(BC)) D: 2,856A ((BC)D): 4,055A (B(CD)): 26,418

Page 31: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

• How do we find the best way to multiply

Mi Mi+1 …Mj (denote the cost by mij )

• Mi … Mk Mk+1 …Mj

# of multiplications = mik + mk+1,j + pi-1 pk pj

• Therefore, mij = min { mik+mk+1,j+ pi-1 pk pj }

The matrix-chain multiplication problem

mik mk+1,j

k

Page 32: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Table look-up (example)

m11=0 m22=0 m33=0 m44=0

m12=10,000 m23=1000 m34=05000

m13=1200 m24=3000

m14=2200

100120

1005020min

4423

3422

24 mm

mmm

15010

12010min

3312

2311

13 mm

mmm

100110

1005010

1002010

min

4413

3412

2411

14

mm

mm

mm

m

M = M1 M2 M3 M4

[10 20] [20 50] [50 1] [1 100]

Page 33: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Table look-up (general)M = M1 M2 … Mn

m11 m22 m33 … mnn

M(n-1)n…m23m12

m13 m24 …

… …

M1(n-1) m2n

m1n

Page 34: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Outline1. Information Retrieval2. Algorithms

1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems

3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)

Page 35: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

NP-Complete Problems

• Complexity of problems– Polynomial O(nk)

• Merge sort• Longest common subsequence

– NP-complete• Not known to be polynomially solvable• The TSP

– Exponential

Page 36: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

The Traveling Salesman Problem

Page 37: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

The Traveling Salesman Problem (TSP)

• A salesman spends his time visiting n cities cyclically.

• In one tour he visits each city exactly once, and finishes up where he started.

• In what order should he visit the cities to minimize the distance traveled?

Page 38: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Candidates grows exponentially

• 3 cities 1 solution.

• 10 cities 181,440 possible tours

• n cities (n-1)!/2 possible tours

Page 39: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

• An optimal solution of the (n-1)-city problem could be useless for the n-city problem.4 cities 5 cities

Reduction is difficult

4

7

85

3

6

3

3

85

5

5 64

73

Page 40: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Approximation Algorithms

• Instead of finding a best solution, one could settle for a sub-optimal solution

• Other types of algorithms:– Branch-and-bound, Genetic algorithm,

non-linear programming, numerical algorithms

– For classification: neural net, support vector machine

Page 41: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

A 50x50 matrix with error rate 5%

111111111111111111N11

111N1111 111111111111111111 1N11111N111111111111

1111111111111 111111

11N11111111111111111 11111111111111111111

1111111111111111

111111111111111111 11111111111111N1

11111111111 111N1111111111N

11111N11111 1111111111111111N11

1111111111111111111111111N111

1111111111111N11111 111111N111111111

11111111111111111111 11111111111111111

111111 111111111N11

11111111111111 11111111

11111111111111111 111111111N11

1111111111 1N1111111111N11111 P

N111N111111111 N1N11111

P 1111111111111111 1111111N11N111111

11111111111111 11111111N111111

11N1111111 N1N111

111111111111111111 1111111111111

11111111111111 P 1111111

11111N111111111 111111111111111111

1N111111111111 111111N11

11111111111111111111 1111111

111111111111111

11111111 1111111111F1F1

111F1111 111111111111111111 1F11111F111111111111

1111111111111 111111

11F11111111111111111 11111111111111111111

1111111 111111111

111111111111111111 111111111111111

11111111111 111F1111111111

111111F1111 1111111111111111F11

11111111111111 11111111111F111

11FF11111111111F1111F1 111111F111111111

11111F11111111111111 11111F11111111111

11111 1111111111F1

11111111111111 11111111

1111111111111111 1111111111

1111111111 11111111111F11111

11F1111111111 111111

1F1111111111111111 1111111F11F111111

11F111111111111 11111111F111111

1F11111111 1F111 111111111111111111

1111111111111 11111111111111

11111111 11111F11111111F1

111111111111111111 1F111111111111

111111F11 11111111111111111111

1111111 11F1111111111111

Page 42: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science
Page 43: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Outline1. Information Retrieval2. Algorithms

1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems

3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)

Page 44: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Random Variables

• Suppose we toss 3 fair coins. If we let X denote the number of heads appearing, then X is a random variable taking on one of the values 0, 1, 2, 3 with respective probabilities:

P{ X=0 } = P{(T, T, T)} = 1/8P{ X=1 } = P{(T, T, H), (T, H, T), (H, T, T)} = 3/8P{ X=2 } = P{(T, H, H), (H, T, H), (H, H, T)} = 3/8P{ X=3 } = P{(H, H, H)} = 1/8

Page 45: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Joint Distribution

• Suppose that 2 balls are randomly selected from an urn containing 2 red, 3 white, and 5 blues balls.

• If we let X and Y denote the number of red and white balls chosen, then the joint probability density function of X and Y is

p(i,j) = P{X=i, Y=j}

Page 46: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Joint Distribution

• red:2, white:3, blue:5

p(0, 0) =

p(0, 1) =

p(0, 2) =

p(1, 0) =

p(1, 1) =

p(2, 0) =

45/10/ 102

52 CC

45/15/ 102

51

31 CCC

45/3/ 102

32 CC

45/1/ 102

22 CC

45/10/ 102

51

21 CCC

45/6/ 102

31

21 CCC

Page 47: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Conditional Probabilities

• An urn contains 2 red, 3 white, and 5 blue balls.

• A ball is chosen at random from the urn, and it is noted that it is not one of the blue balls. What is the probability that it is white?

Page 48: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Conditional Probabilities

• red: 2, white: 3, blue: 5

P(White | Not Blue)

= P(White) / P(NotBlue)

=

= 3/5105

103

Page 49: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Outline1. Information Retrieval2. Algorithms

1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems

3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)

Page 50: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Markov Process• A study process

• x1 = primary schools that one studies in.• x2 = junior high schools• x3 = senior high schools• x4 = universities• x5 = graduate schools

• A sequence of random variables x1, x2, x3, …

P(Xi = | Xi-1= xi-1, Xi-2= xi-2, Xi-3= xi-3)

= P(Xi= xi | Xi-1= xi-1)

Page 51: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Outline1. Information Retrieval2. Algorithms

1. The basics2. Divide and conquer3. Dynamic programming4. NP-complete problems

3. Probabilistic Models1. The basics2. Markov Process3. Hidden Markov Model (HMM)

Page 52: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Hidden Markov ModelAn example

• An occasionally dishonest casino

= {1, 2, 3, 4, 5, 6}Q = {F, U}aFF = 0.95aFU = 0.05aUU = 0.9aUF = 0.1eF(1) = 1/6 eU(1) = 1/10eF(2) = 1/6 eU(2) = 1/10eF(3) = 1/6 eU(3) = 1/10eF(4) = 1/6 eU(4) = 1/10eF(5) = 1/6 eU(5) = 1/10eF(6) = 1/6 eU(6) = 1/2

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

0.95

0.05

0.1

0.9

UF

Page 53: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Observation Prediction

Rolls 246446644245311321631164152133625144543631656626566666

Hidden FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUUUUUUUUUPredict FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFUUUUUUUUUUUU

Rolls 651166453112456366646316366631623264552362666666251516

Hidden UUUUUUFFFFFFFFUUUUUUUUUUUUUUUUFFFUUUUUUUUUUUUUUFFFFFFFPredict UUUUUUFFFFFFFFUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUFFFFFF

Rolls 222555441666566563564324364131513465146126414626253356

Hidden FFFFFFFFUUUUUUUUUUUUUFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFUUPredict FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFU

Page 54: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Definition

• A Hidden Markov Model (HMM) is a triple M = (, Q, ), where: is an alphabet of symbols.– Q is a finite set of states capable of

emitting symbols from . is a set of probabilities, comprised of

• State transition probabilities

(akl, k, l Q).• State Emission probabilities

(ek(b), k Q, b )

Page 55: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

• A state path = (1, ..., L) in the model M is a sequence of states.

• For a sequence x = (x1, ..., xL) *, the probability that x was generated by M based on the state path is

P(x,) =

The Joint Probability P(x,)

L

ii iiiaxea

1110

)(

Page 56: Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

A Simple HMMAn alternating sequence of exons and introns

Exon Intron

A 0.4C 0.1G 0.1T 0.4

A 0.05C 0.4G 0.5T 0.05

hidden

Observation

… …E E E E I I I E E E

A T C A A G G C G T

0.9

0.1

0.01

0.99