cpg island identification with hidden markov models

47
CpG Island identification with Hidden Markov Models - Kshitij Tayal 1

Upload: kshitij-tayal

Post on 12-Jan-2017

600 views

Category:

Software


2 download

TRANSCRIPT

Page 1: CpG Island Identification with Hidden Markov Models

CpG Island identification with Hidden Markov

Models !

- Kshitij Tayal

1

Page 2: CpG Island Identification with Hidden Markov Models

CpG Island• Region of the genome with high frequency of CpG

sites than the rest of the genome.

• Formal Definition - CpG island is a region with at least 200 bp, and a GC percentage that is greater than 50 % .

• CpG is shorthand for “—C—phosphate—G—that is, cytosine and guanine separated by only one phosphate.

2

Page 3: CpG Island Identification with Hidden Markov Models

Genome ~ 3 billion characters. Find gene ?

3

Page 4: CpG Island Identification with Hidden Markov Models

Importance of CpG Islands• CpG island acts as a proxy to

identify a gene.

• They often occur at the start of the gene.

• Cytosines in CpG dinucleotides can be methylated(have methyl group attache) to form 5-methylcytosine.

4

Page 5: CpG Island Identification with Hidden Markov Models

5

Page 6: CpG Island Identification with Hidden Markov Models

Importance of Methylation• Our body consist thousand of cell . Every cell of our body

contain same copy of DNA with same blueprint of genetic code, then how do they decide among themselves which function has to performed ?

• How Does heart cell know it’s a heart cell

• How Does skin cell know it’s skin cell.

• They need outside instructions from these little carbon hydrogen compounds called methyl group.

• How characteristics change across generations without changes to the DNA sequence itself.

6

Page 7: CpG Island Identification with Hidden Markov Models

Epigenetics & CpG Islands• Literal meaning of epigenetic is ‘above genetics’. It

decides methylation of CpG island

• CpG islands regulate expression of nearby genes.

• Proteins involved in gene expression can be repelled or attracted by the methyl group

7

Page 8: CpG Island Identification with Hidden Markov Models

Background: Epigenetics• Environmental factors like what we do, what we eat, what we

smoke and how stressed we are decide the methyl group binding.

• Bad diet can actually lead methyl group binding to the wrong place and with these bad instruction cell become abnormal and become disease

• Epigenetics is also controlled by histones. Histones are protein that are basically spools that DNA wind itself around . Histones can change how tightly or loosely the DNA is around them.

• If loosely around — the gene get more expressed

• If tightly around — the gene get less expressed

8

Page 9: CpG Island Identification with Hidden Markov Models

9

Page 10: CpG Island Identification with Hidden Markov Models

Background: Epigenetics• So methyl group is more like a ‘switch’ and histones

are more like a ‘knob’

• Every cell of your body has a distinct methylation and histones pattern that gives every cell its marching order.

• DNA can be thought of as body ‘hardware’ and epigenome is more like a software which tells the hardware what work it has to do and hence justifies its meaning.

10

Page 11: CpG Island Identification with Hidden Markov Models

Now Some Computer Science……..

• Task - Design a method that, given a candidate string (k-mer), score it according to how confident it came from CpG Island.

• Apply, Sequence Model which is a probabilistic model that associates probabilities with sequences.

11

Page 12: CpG Island Identification with Hidden Markov Models

Sequence Models

• Sequence models learn from examples.

• Say we have sampled 100K 5-mers from inside CpG islands and 100K 5-mers from outside.

• Can we guess whether CGCGC came from CpG island.?

• P(inside) = 315/(315 + 12)

12

# CGCGC inside 315

# CGCGC outside 12

Page 13: CpG Island Identification with Hidden Markov Models

Sequence Models • To estimate p(x) we count # times x appears in the

training set labelled INSIDE divided by total # of times x appears in training set.

• But for sufficiently long k, we might not see any occurrences of x, or very few.To overcome this limitation we will go for joint probability distribution.

• P(X) = P(Xk,Xk-1,………X1) where P(X) is the probability of sequence X

13

Page 14: CpG Island Identification with Hidden Markov Models

14

Page 15: CpG Island Identification with Hidden Markov Models

15

Page 16: CpG Island Identification with Hidden Markov Models

16

Page 17: CpG Island Identification with Hidden Markov Models

• P(x) now equal product of all the Markov chain edge weights on our

string driven walk through the chain

!!

• Nodes label are symbol and transition label are conditional probability

17

Page 18: CpG Island Identification with Hidden Markov Models

18

Page 19: CpG Island Identification with Hidden Markov Models

19

Page 20: CpG Island Identification with Hidden Markov Models

20

Page 21: CpG Island Identification with Hidden Markov Models

Hidden Markov Model• In simpler Markov models (like a Markov chain), the

state is directly visible to the observer, and therefore the state transition probabilities are the only parameters.

• In a hidden Markov model, the state is not directly visible, but the output, dependent on the state, is visible. Each state has a probability distribution over the possible output tokens. The adjective 'hidden' refers to the state sequence through which the model passes.

21

Page 22: CpG Island Identification with Hidden Markov Models

22

Page 23: CpG Island Identification with Hidden Markov Models

23

Page 24: CpG Island Identification with Hidden Markov Models

24

Page 25: CpG Island Identification with Hidden Markov Models

25

Page 26: CpG Island Identification with Hidden Markov Models

26

Page 27: CpG Island Identification with Hidden Markov Models

27

Page 28: CpG Island Identification with Hidden Markov Models

28

Hidden Markov Model- Viterbi Algorithm

• Given flips can we say when the dealer was using loaded coin.

• We want to find p* , the most likely path given the emission.

!

• Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called the Viterbi path – that results in a sequence of observed events.

Page 29: CpG Island Identification with Hidden Markov Models

29

Page 30: CpG Island Identification with Hidden Markov Models

30

Page 31: CpG Island Identification with Hidden Markov Models

31

Page 32: CpG Island Identification with Hidden Markov Models

32

Page 33: CpG Island Identification with Hidden Markov Models

33

Page 34: CpG Island Identification with Hidden Markov Models

34

Page 35: CpG Island Identification with Hidden Markov Models

35

Page 36: CpG Island Identification with Hidden Markov Models

36

Page 37: CpG Island Identification with Hidden Markov Models

37

Page 38: CpG Island Identification with Hidden Markov Models

38

Page 39: CpG Island Identification with Hidden Markov Models

39

Page 40: CpG Island Identification with Hidden Markov Models

Hidden Markov Model

40

Page 41: CpG Island Identification with Hidden Markov Models

Hidden Markov Model

41

Page 42: CpG Island Identification with Hidden Markov Models

42

EMISSIONS

Page 43: CpG Island Identification with Hidden Markov Models

43

Page 44: CpG Island Identification with Hidden Markov Models

44

Hidden Markov Model

Page 45: CpG Island Identification with Hidden Markov Models

45

Page 46: CpG Island Identification with Hidden Markov Models

46

Page 47: CpG Island Identification with Hidden Markov Models

THANK YOU

47