ADN Markov Ocults 3 problemes Referencies Apendix
Cadenes de Markov Ocultes
Aplicacions a la genomica
Merce Farre
Seminari del Servei d’Estadıstica UAB24 d’octubre de 2008
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix
Aplicacions
Els models ocults de Markov (HMM) s’apliquen, entre altres areesde recerca, al reconeixement de la parla (speech recognition) i atasques bioinformatiques relacionades amb la sequenciacio del’ADN, entre d’altres ambits de la genomica.
A la bibliografia nomes s’esmenten referencies d’aquestes duesaplicacions. A l’apendix de la bibliografia hi trobareu algunsextractes d’articles i texts referents a d’altres aplicacions: visio percomputador, econometria i finances, transmissio d’informacio, etc.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Molecula Codons Traduccio
Index
1 PreliminarsLa macro-molecula d’ADNNucleotids, codons i aminoacidsProblemes de traduccio
2 Models de MarkovProces estocasticCadena de MarkovCas homogeni: probabilitats de transicioLlei estacionaria. Sistema estable i ergodicitat
3 Models ocultsQue es un HMM? Un exempleEls parametres d’un HMMModels ocults a l’ADN
4 Tres problemes. AlgorismesNotacions habituals en els HMMEls tres problemes basics dels HMMAlgorismes eficientsQuestions
5 Referencies
6 ApendixAltres aplicacions
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Molecula Codons Traduccio
La doble espiral de l’ADN
http://ca.wikipedia.org/wiki/ADN
Introduım vocabulari a partir d’algunes pagines auxiliars.Ref: [RRS] Robin, Rodolphe and Schbath, DNA. Words andModels. Statistics of Exceptional Words.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Molecula Codons Traduccio
Cadena de nucleotids (bases)(Ref: [RRS])
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Molecula Codons Traduccio
Codons i aminoacids (Ref: [RRS])
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Molecula Codons Traduccio
Traduccio, sinonims, consens, blocs, llacunes,... (Ref:
[RRS])
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Molecula Codons Traduccio
Vocabulari. Models aleatoris.
Nucleotids (bases), codons, aminoacids, proteınaParaula (word), motiu (motif), palındrom (a c g t)Codi genetic: l’aplicacio que associa a cada codon l’aminoacido el ”senyal d’stop”Gens - regions codificants (exons i introns)Regions inter-geniques (no codificants)
La complexitat de transcripcio, les diferents regions a l’ADN, lapresencia de bases “erroniament” inserides o esborrades, etc., fanque models mes classics no s’ajustin prou be a la situacioexperimental.
Si es disposa de un bon model, es podra utilitzar, per exemple, persaber on comencen i acaben les regions codificants, per detectargens, per decidir si certes paraules son excepcionals (massa omassa poc frequents).
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Proces Cadena Transicions Estabilitat
Index
1 PreliminarsLa macro-molecula d’ADNNucleotids, codons i aminoacidsProblemes de traduccio
2 Models de MarkovProces estocasticCadena de MarkovCas homogeni: probabilitats de transicioLlei estacionaria. Sistema estable i ergodicitat
3 Models ocultsQue es un HMM? Un exempleEls parametres d’un HMMModels ocults a l’ADN
4 Tres problemes. AlgorismesNotacions habituals en els HMMEls tres problemes basics dels HMMAlgorismes eficientsQuestions
5 Referencies
6 ApendixAltres aplicacions
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Proces Cadena Transicions Estabilitat
Proces
Un proces estocastic modela quelcom que evoluciona al llarg deltemps (o d’una sequencia de passos) i que, a cada instant, te unaresposta no determinista (aleatoria o estocastica):
instant: t velocitat vent: Xt ∈ [0,Vmax] = Slloc d’una seq. ADN: n nucleotid: Xn ∈ {a, c , g , t} = S
Aixo no vol dir que les respostes siguin completamentimprevisibles, perque tenen una llei: una distribucio de probabilitatdins de l’espai d’estats del proces (S).
Per exemple, en cert instant hi ha velocitats mes probables qued’altres, en funcio de les velocitats en instants anteriors. Hi hanucleotids mes repetits que d’altres, depenent de si estem en unaregio codificant de l’ADN o no.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Proces Cadena Transicions Estabilitat
Cadena de Markov
La paraula Markov significa dependencia, en termes probabilıstics,del passat mes recent. Un model Markovia amb temps discretd’orde m (Mm) indica dependencia dels m instants anteriors. Sinomes l’estat present determina la llei del futur, el model es M1.Si cada instant es completament independent de tot el passat, elmodel es M0 o de Bernoulli.
Cadena
Un proces aleatori de Markov que te un nombre numerable d’estats(nosaltres el suposem finit, S = {Si , 1 ≤ i ≤ N}) es una cadenade Markov. Suposarem a mes que la sequenciacio del tempstambe es discreta i que es de tipus M1.Sequencia-temps: t . . . t − 1 t t + 1 . . .
Proces: Xt . . . Xt−1 Xt Xt+1 . . .Xt ∈ S.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Proces Cadena Transicions Estabilitat
Cas homogeni: probabilitats de transicio
L’estudi dels processos de Markov, i de les cadenes en particular, esfacil si es te la propietat d’homogeneıtat en el temps: lestransicions d’un estat a un altre tenen sempre la mateixaprobabilitat al llarg del temps (no depenen de t). L’homogeneıtates una restriccio forta del model; un handicap que els HMM eviten.
Graf d’una cadena homogenia
Les probabilitats de transicio (d’ordre 1) en una cadena finita espoden representar amb un graf com el seguent (cas amb N = 4estats):
S1p11 66p12 ++
p13
p14
!!
S2p21kk p22hh
p24
S3p33 66
p34 ++
p31
II
S4p43kk p44hh
p42
II
p41
aa
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Proces Cadena Transicions Estabilitat
Un exemple
L’agitada vida de Doudou
El comportament del hamster Doudou, que a cada minut canvia oes mante en tres estats, menjar, dormir i fer exercici, es potmodelar amb una cadena de Markov:
"http://fr.wikipedia.org/wiki/Cha%C3%AEne de Markov"
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Proces Cadena Transicions Estabilitat
Llei de la cadena
Distribucio inicial i matriu de probabilitats de transicio
Distribucio inicial dels estats i matriu de transicions (cas N = 4):
π′
0 =(
π01 π02 π03 π04
)P :=
p11 p12 p13 p14
p21 p22 p23 p24
p31 p32 p33 p34
p41 p42 p43 p44
Les files de la matriu P son les probabilitats de transicio entreestats (lleis condicionals): cada fila suma 1.
Propietat. Llei de la cadena al cap de n passos
La llei queda determinada per π0 i P . En particular, la distribuciomarginal de les probabilitats dels estats al cap de 1 i de ntransicions es, respectivament:
π′
1 = π′
0 P π′
n = π′
0 Pn
Tambe es poden obtenir les distribucions conjuntes en k instants.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Proces Cadena Transicions Estabilitat
Llei estacionaria. Sistema estable.
Definicio
Una distribucio dels estats π∗ es estacionaria siπ′
∗= π′
∗P
Un sistema es estable o estacionari si, independentment de ladistribucio inicial, la llei de la cadena convergeix cap a ladistribucio estacionaria
limn→∞ π′Pn = π′
∗, ∀π
Observacio
L’estabilitat dels sistemes M1 depen del tipus de matriu detransicio.
Reductible :
0.5 0.5 00.5 0.5 00 0 1
Periodica :
(0 11 0
)
Una cadena es irreductible si es possible comunicar qualsevolparella d’estats al llarg del temps.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Proces Cadena Transicions Estabilitat
Estabilitat i ergodicitat
T. d’estabilitat
Una cadena de Markov homogenia amb espai d’estats finit,aperiodica i irreductible te una unica llei estacionaria π∗ i,independentment de la llei inicial, la llei de la cadena convergeixcap a la llei estacionaria (es un sistema estable).
Ergodicitat (slln)
Si tenim un sistema estable, despres d’un llarg recorregut o llargtermini (long run), el proces es mou seguint la llei estacionaria.
Si la sequencia es prou llarga, la frequencia (relativa) de visites aun estat aproxima la probabilitat d’estar en aquest estat en uninstant qualsevol. L’histograma d’una sequencia particular(amb prou passos) aproxima la llei estacionaria.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Concepte Parametres HMM a l’ADN
Index
1 PreliminarsLa macro-molecula d’ADNNucleotids, codons i aminoacidsProblemes de traduccio
2 Models de MarkovProces estocasticCadena de MarkovCas homogeni: probabilitats de transicioLlei estacionaria. Sistema estable i ergodicitat
3 Models ocultsQue es un HMM? Un exempleEls parametres d’un HMMModels ocults a l’ADN
4 Tres problemes. AlgorismesNotacions habituals en els HMMEls tres problemes basics dels HMMAlgorismes eficientsQuestions
5 Referencies
6 ApendixAltres aplicacions
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Concepte Parametres HMM a l’ADN
Exemple: l’amic invisible
Suposem que el clima de cert lloc llunya es mou segons unacadena M1 homogenia amb dos estats:
PL (plou) i FS (fa sol)
Aquests estats son ocults, nomes es poden observar indirectament.
L’unic que podem saber, via correu electronic, es l’estat anımic delnostre amic que te tres observables basics:
I (irritable), T (tranquil) i E (euforic)
L’estat anımic d’un dia no influeix directament en com estara aldia seguent, nomes a traves del clima (estats ocults).
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Concepte Parametres HMM a l’ADN
Concepte
Es un proces estocastic doble, format per un proces ocult deMarkov i un proces observable visible.
En el cas mes senzill de HMM, se suposa que hi ha un nombrefinit estats ocults (no observables), governats per unacadena de Markov homogenia M1. Aquests estats ocultsemeten en cada instant de temps uns sımbols o estatsobservables, tambe anomenats residus.
NO es postula cap model de dependencia directa entreels estats observables, nomes la seva dependencia delsestats ocults i el model Markovia sobre els estats ocults.
“Un HMM es un model homogeni a trossos de tal manera quela segmentacio entre parts homogenies es aleatoria”. “Es unautomata probabilıstic finit”.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Concepte Parametres HMM a l’ADN
Parametres
El model esta completament determinat si els seus parametres(tots ells probabilitats) son coneguts.
Considerem N estats ocults (S1, . . . ,SN) i K observables(a1, . . . , aK ), l’anomenat alfabet de K sımbols.
Parametres de la cadena oculta, la distribucio inicial i lamatriu de probabilitats de transicio:
π N × 1 ⇒ (N − 1) parametres
P N × N ⇒ (N2 − N) parametres
Parametres de la relacio entre estats ocults i observables, lesprobabilitats d’emissio d’observables:bij := probabilitat que l’estat ocult Si emeti el sımbol aj
B N × K ⇒ (NK − N) parametres
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Concepte Parametres HMM a l’ADN
Graf d’un model ocult
Parametres: un exemple
A l’exemple de l’amic invisible, N = 2 i K = 3, i suposem:
π =
(1323
)
P =
(0.3 0.70.2 0.8
)
B =
(0.4 0.4 0.20.1 0.6 0.3
)
Graf d’un HMM: un exemple
El graf tambe s’anomena arquitectura o topologia del HMM.
Mostrem el de l’exemple anterior (en veurem d’altres de mescomplexos per a l’ADN):
?>=<89:;PL0.3 550.7 ++
0.4�� 0.4 B
BBBB
BBB
0.2((PPPPPPPPPPPPPPP?>=<89:;FS
0.2kk 0.8jj
0.1nnnn
nnn
vvnnnnnnn 0.6||
||
~~||| 0.3
��I T E
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Concepte Parametres HMM a l’ADN
Un model per reconeixer l’inici d’un gen
5 IEStart End
A = 0.25C = 0.25G = 0.25 T = 0.25
A = 0.05C = 0G = 0.95 T = 0
A = 0.4C = 0.1G = 0.1 T = 0.4
1.0
0.9
0.1 1.0
0.9
0.1
–43.90–43.45–43.94–42.58–41.71
11%
46%
28%
–41.22
Sequence:E 5E EEEEE EEEEEState path: log P
Posteriordecoding:
Parsing:
CT TCATG TGAAAGC AG ACGTAAGTCAEE EE EE I I I I I I I
Figure 1 A toy HMM for 5′ splice site recognition. See text for explanation.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Concepte Parametres HMM a l’ADN
Un model complex (Ref: [KMH])
A A A
A A C
A A G
T T T
Total of61 tripletmodels
Stop codons
Start codons
Intergene
model
Begin EndGCT
A
GCT
A
GCT
A
GCT
A
GCT
A
GCT
A
GCT
A
GCT
A
GCT
A
GCT
A
GCT
A
GCT
A
GC
T
A
�kPY�OZ�JME°®O��º©¯°¯ meJsGKlQPYHME/GIHMZ�JMEvÜLOJÁmæ¨WmeJsSXETJ,vÜLOJ"Â�ÃcÄsÅ�ÆÓÇ�w«À�� ÚcPYHMlàm¾SVPYy"¨QCYE$PYN%HMETJK�OETNQPRG$y"Lj^QEIC�`°icl�EÄ��������OÆ�� ������,Ï�SXlWmO^QE/^°GKPYJsGKCYEeÐsgA�OETN�ETJsmeHKE/S�N�LuNQZWGKCYETLOHXPR^QE/S�meNW^æPRS«ZWSßE/^ HKL½GILON�N�E/GIH�mnCÓCxHKl�EÁyLj^QEICRST`�icl�E��®$HMJXPY¨QCYETH"LOJ�GILj^QL�N»y"Lj^QEICRS"m��WLerOE$HMl�E½GIETNQHMJsmnCqSßHsmeHMEzmnCÓC�lWm�rOEPR^QETN%HXPRGTm�C«SXHMJMZWGIHMZ�JKE/STgxSXl�LeÚ�NÝPYN¢^QETHsm�P]CvÜLOJ�HMl�EuGIL�^QLON;���«�©` � ÕOZWm�JME/S,JMET¨�JME/SXETNQH�yzm�PYNàSßHsmeHME/S �?^�PRmey"L�NW^�SÁ^QETN�L�HME m SXHImeHME"Ú�l�ETJKE½mæN%ZWGKCYETLOHXPR^QEGTmeN¾�WE«PYNWSXETJKHME/^¾�WETHXÚcETETN¼GILONWSXE/GIZ�HßPFr�E�GIL�^QLON¾N%ZWGKCYETLOHXPR^QE/SÛÚ�l�ETJME/mOS�GKPYJsGKCYE/S��OETN�ETJsmeHME«N�L$NQZWGKCFETL�HXPR^QEzmeNW^GTmeN��8E�ZWSXE/^ÒHKL¢^QEICFETHKE;LON�EÝLevÁHMl�EÝHMl�JMETE�NQZWGKCYETLOHXPR^QE/ST`�iØl�E�HMlQPRGM{QN�E/SMS¾Lev�HMl�E;meJMJKLnÚ�S�PYNW^�PRGTmeHME4HMl�EvÜJsmOGIHXPYLON$LevDSXE/Õ�Z�ETNWGIE/Sqyzme{�PFN���HMl�E«��PFr�ETNuHMJsmeNWSVPRSXHßPFL�N�`DiØl�E�PFNWSßETJMH©SXHsm�HMEcPYN HMl�E�y�PR^�^�CYE�LevAHMl�E�PYN%HMETJK�OETNQPRGy"L�^QEIC�Ï�^�PRmey"LONW^8Ð�¨�JKLj^QZWGIE/S�Jsm�NW^QLOy SXE/Õ�Z�ETNWGIE/S�vÜJMLOy m"�WmOSßE�^�PRSXHKJXPY��Z�HXPYLON;E/SßHXPYyzmeHME/^ævÜJMLOy�HMl�E"mOGIHMZWm�C^�PRSXHMJXPY��Z�HXPYLONÝLevx�WmOSXE/SqPYN°HMl�E,PYN%HMETJK�OETNQPRG�JMET�ePYLONWSqLev�HKl�E�HMJsmnPFNQPYN�� SXETH�`ÔiØl�E�vÜLOZ�J��WmOSXE/S�lWm�rOE�m�CYy"L%SXH�HMl�ESMmeyE�v�JME/Õ�Z�ETNWGItO`
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Concepte Parametres HMM a l’ADN
Un model mes complex (Ref: [KMH])
Model ofCoding region
TA
AG
T A GT
T GGA
Stop codons
Start codons
Overlap models
Intergene models
Long intergenic regions
Short
GCT
A
GCT
A
GCT
A
�kPY�OZ�JME"b���º�¯°¯*meJsGKlQPYHME/GIHMZ�JME,vÜLOJÔm$¨Wm�JsSXETJqv�LOJ,ÂØÃ\ÄIÅOÆ]Ç�w«Àq�ÎÚØPFHKl×m$GILOy"¨QCYEIÞ°PYN%HKETJM�OETNQPRG�y"L�^QEIC�`�icl�E�OETN�E�yLj^QEIC�me�8LerOExHMl�E�GIETN%HMJIm�ChSßHsmeHME\HMlWmeHxGILONQHsm�PYNWSAHMl�E���®cHMJßPF¨QCYETH\y"Lj^QEICRSDP_SDPR^QETN%HXPRGTm�CaHML�HMl�E��OETN�Ecy"L�^QEICLevaHMl�E�SVPYy"¨QCYE©¨WmeJISXETJxSXl�LeÚ�N«PYN �kPY�OZ�JKE�®O`¿iØl�E©^QETHsm�P]CYE/^zSßHMJMZWGIHMZ�JMEcLevWHKl�EcCYLON���PYN%HMETJK�OETNQPRGcy"L�^QEICWPRS?SXl�LeÚ�NPYN �kPY�OZ�JME«d8`
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Index
1 PreliminarsLa macro-molecula d’ADNNucleotids, codons i aminoacidsProblemes de traduccio
2 Models de MarkovProces estocasticCadena de MarkovCas homogeni: probabilitats de transicioLlei estacionaria. Sistema estable i ergodicitat
3 Models ocultsQue es un HMM? Un exempleEls parametres d’un HMMModels ocults a l’ADN
4 Tres problemes. AlgorismesNotacions habituals en els HMMEls tres problemes basics dels HMMAlgorismes eficientsQuestions
5 Referencies
6 ApendixAltres aplicacions
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Notacions de les sequencies
qt son les v. a. que donen els estats ocults.
Ot son les v.a. que donen els residus observables.
pas-temps: t . . . t − 1 t t + 1 . . .
estats ocults: qt . . . qt−1 qt qt+1 . . .
sımbols observables: Ot . . . Ot−1 Ot Ot+1 . . .
T (natural) es la mida o longitud de la sequencia.
O = (O1 · · ·OT ) es la sequencia observada.
q = (q1 · · · qT ) es la sequencia d’estats ocults (querysequence).
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Notacions del model
El HMM
Denotem λ el parametre (multidimensional) que representa totesles probabilitats (inicials, de transicio i d’emissio) i que caracteritzael HMM.
Model (HMM): λ = (π,P ,B)
Observacio
Donat un model, aquest es pot utilitzar per simular (generarsequencies), per quantificar la probabilitat de certs observables,reconeixer-predir estats ocults, identificar observables (veure siuna seq. observada es pot alinear o identificar amb una altra deconeguda), etc.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Notacions de les probabilitats d’interes, donat el model λ
A priori de la sequencia oculta:Pλ(q) (= πq1
∏Tj=2 pqj−1qj
)
Condicionada “natural”:Pλ(O|q) (=
∏Ti=1 Pλ(Ot |qt) =
∏Ti=1 bqt (Ot))
Observables i ocults, conjunta:Pλ(q,O) = Pλ(q)Pλ(O|q)
Probabilitat total:
Pλ(O) =∑
q Pλ(q,O) (=∑
q1,...,qT︸ ︷︷ ︸
NT
πq1
T∏
j=2
pqj−1qj
T∏
i=1
bqt (Ot)
︸ ︷︷ ︸
2T
)
A posteriori, de la sequencia oculta (posterior decoding):
Pλ(q|O) = Pλ(q,O)Pλ(O) (Bayes, condic. “invers”)
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Problema 1
Problema 1: avaluacio (puntuacio, scoring)
(I) Problema d’avaluacio: suposant el model λ conegut, i donadauna sequencia d’observables O, es vol calcular la probabilitatd’aquesta sequencia:
Pλ(O) = Pλ(O1 · · ·OT ) =∑
q
Pλ(q,O)
En les aplicacions, pot interessar decidir-se per un model o altrecalculant els scores de la sequencia observada, i veure en quinmodel son mes elevats.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Problema 2
Problema 2: segmentacio (prediccio, optimalitat, parsing)
(II) Problema de segmentacio: suposant λ conegut, i donadauna sequencia d’observables O, interessa determinar la sequencia(trajectoria, path) d’estats ocults condicionalment mesprobable. Escriurem: q∗ = q∗
1 , . . . , q∗
T , tal que
q∗ := arg maxqPλ(q|O) = arg maxqPλ(q,O)
Nota: hi ha altres probabilitats a maximitzar.
En ADN, ens interessara, per exemple, pronosticar si estem en unazona codificant o en una altra regio, o si es pot identificar un gen,o una proteına, etc.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Problema 3
Problema 3: estimacio (entrenament)
(III) Problema d’estimacio: estimar λ conjunt de parametresque fan maxima la probabilitat de la seq. observada:
λ∗ := arg maxλ
Pλ(O) (max. versermblanca)
Es tenen dades d’entrenament i de validacio.
L’objectiu de les metodologies en HMM es resoldre els tresproblemes de manera eficient amb la mınima complexitatalgorısmica.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Un model per reconeixer l’inici d’un gen
5 IEStart End
A = 0.25C = 0.25G = 0.25 T = 0.25
A = 0.05C = 0G = 0.95 T = 0
A = 0.4C = 0.1G = 0.1 T = 0.4
1.0
0.9
0.1 1.0
0.9
0.1
–43.90–43.45–43.94–42.58–41.71
11%
46%
28%
–41.22
Sequence:E 5E EEEEE EEEEEState path: log P
Posteriordecoding:
Parsing:
CT TCATG TGAAAGC AG ACGTAAGTCAEE EE EE I I I I I I I
Figure 1 A toy HMM for 5′ splice site recognition. See text for explanation.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Quatre algorismes per a tres problemes
Per resoldre el problema 1 (puntuacio) es pot utilitzar un algorismed’avanc (forward) o, alternativament, un algorisme de retroces(backward).
Per resoldre el problema 2 (segmentacio) es pot utilitzar unalgorisme d’avanc-retroces (forward-backward) anomenat deViterbi. Es basa en programacio dinamica: optimitzacio deprocessos de decisio sequencial.
Per resoldre el problema 3 (estimacio, entrenament) es pot utilitzarun algorisme de re-estimacio iteratiu anomenat deBaum-Welch. Metodes del tipus EM (expectation maximization).
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
Algorisme de ViterbiS1 ( ) S1 ( ) S1 ( )
Si (*) Si ( ) Si ( )
SN ( ) SN (*) SN (*) O1 O2 O3
(δ1) (δ2) (δ3)
ADN Markov Ocults 3 problemes Referencies Apendix Notacions Tres problemes Algorismes Questions
Questions
1 On comencen i on acaben les regions codificants? Comlocalitzar promotors dels gens?
2 Hi ha nucleotids inserits o esborrats erroniament?3 Paraules excepcionals (massa o massa poc frequents)?4 Estudiar nous codis genetics identificant la part comuna (per
alineacio)5 Quins gens estan expressats (activats, sintetitzen proteınes,
incrementen els nivells de RNA)6 Donades moltes mostres de quantitats enormes de gens
(d’una o varies cel·lules, gens iguals o diferents,...): trobarclusters de gens que s’activen simultaniament, trobar clustersde mostres que tenen els mateixos gens activats, etc.
7 Altres eines: Compound poisson processes (3, recomptes deparaules amb solapament,...) BLAST (basic local alignmentsearch) (4,5,6) Microarrays (5,6).
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix
Index
1 PreliminarsLa macro-molecula d’ADNNucleotids, codons i aminoacidsProblemes de traduccio
2 Models de MarkovProces estocasticCadena de MarkovCas homogeni: probabilitats de transicioLlei estacionaria. Sistema estable i ergodicitat
3 Models ocultsQue es un HMM? Un exempleEls parametres d’un HMMModels ocults a l’ADN
4 Tres problemes. AlgorismesNotacions habituals en els HMMEls tres problemes basics dels HMMAlgorismes eficientsQuestions
5 Referencies
6 ApendixAltres aplicacions
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix
Referencies
W J. Ewens and G R. GrantStatistical Methods in Bioinformatics. An Introduction.Second Edition. Springer. Statistics for biology and health, 2005.
S. Robin, F. Rodolphe and S. SchbathDNA, Words and Models. Statistics of Exceptional Words.Cambridge university press, 2005.
L.E. Baum, T. Petrie, G. Soules and N. WeissA maximization technique occurring in the statistical analysis ifprobabilistic functions of Markov chains.The Ann. Math. Stat., 41(1):164-171, 1970.
H.P. Chan, N.R. Zhang and L.H.Y. ChenImportance sampling of word patterns in DNA and protein sequences.Preprint, 0:00-00, 2007.
S. R. EddyWhat is a hidden Markov model?
Nature Biotechnology, 22(10):1315-16, 2004.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix
References
S.T. Jensen, X.S. Liu, Q. Zhou and J. S. LiuComputational Discovery of Gene Regulatory Binding Motifs: A BayesianPerspective.Statistical Science, 19(1):188-204, 2004.
B.H. Juang and L.R. RabinerHidden Markov Models for Speech Recognition.Technometrics, 33(3):251-272, 1991.
A. Krogh, S. Mian and D. HausslerA hidden Markov model that finds genes in E.coli DNA.Nucleic Acids Research, 22(22):4768-4778, 1994.
L R. RabinerA Tutorial on Hidden Markov Models and Selected Applications in SpeechRecognition.
Proceedings of the IEEE, 77(2):257-286, 1989.
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix +refer Mendel
Index
1 PreliminarsLa macro-molecula d’ADNNucleotids, codons i aminoacidsProblemes de traduccio
2 Models de MarkovProces estocasticCadena de MarkovCas homogeni: probabilitats de transicioLlei estacionaria. Sistema estable i ergodicitat
3 Models ocultsQue es un HMM? Un exempleEls parametres d’un HMMModels ocults a l’ADN
4 Tres problemes. AlgorismesNotacions habituals en els HMMEls tres problemes basics dels HMMAlgorismes eficientsQuestions
5 Referencies
6 ApendixAltres aplicacions
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix +refer Mendel
Vegem les pagines auxiliars ...
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
Apèndix Hidden Markov Models
Applications to Financial Economics Series: Advanced Studies in Theoretical and Applied Econometrics , Vol. 40 Bhar, R., Hamori, Shigeyuki
2004, XVIII, 160 p., Hardcover ISBN: 978-1-4020-7899-6 -------------------------------------------------------------------------------------------------------------------------------- Aplicacions a la visió per computador:
Hidden Markov Models and applications to speech modeling [Bahl and Jelinek, 1975]
L. Bahl and F. Jelinek. Decoding for channels with insertions, deletions and substitutions with application to speech recognition. IEEE Transactions on Information Theory, IT-21:404-411, 1975.
[Bahl et al., 1987] L. Bahl, P. Brown, P. de Souza, and R. Mercer. Estimating Hidden Markov Models so as to maximize speech recognition accuracy. Technical Report RC 13121, IBM T.J. Watson Research Center, 1987.
[Bourlard et al., 1985] H. Bourlard, Y. Kamp, and C. Wellekens. Speaker dependent connected speech recognition via phonemic Markov models. In International Conference on Acoustic, Speech and Signal Processing, pages 1213-1216, Tampa, 1985.
[Brown et al., 1983] P. Brown, C. Lee, and J. Spohrer. Bayesian adaptation in speech recognition. In International Conference on Acoustic, Speech and Signal Processing, pages 761-764, Boston, 1983.
[Jelinek and Mercer, 1980] F. Jelinek and R. Mercer. Interpolated estimation of Markov source parameters from sparse data. In E. Gelsema and L. Kanal, editors, Pattern Recognition in Practice, pages 381-397. North-Holland, Amsterdam, 1980.
[Lee et al., 1990] C.H. Lee, C.H. Lin, and B.H. Juang. A study on speaker adpatation of continuous density HMM parameters. In International Conference on Acoustic, Speech and Signal Processing, pages 145-148, 1990.
[Lee et al., 1991] C.H. Lee, C.H. Lin, and B.H. Juang. A study on speaker adpatation of continuous density Hidden Markov Models. IEEE Transactions on Signal Processing, 39(4):806-814, 1991.
[Leggetter and Woodland, 1994] C.J. Leggetter and P.C. Woodland. Speaker adaptation of continuous density HMMs using multivariate linear regression. In International Conference on Spoken Language Processing, pages 451-454, Yokohama, 1994.
[Levinson et al., 1983] S. Levinson, R. Rabiner, and M. Sondhi. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell Systems Technical Journal, 62:1035-1074, 1983.
[Liporace, 1982] L. Liporace. Maximum likelihood estimation for multivariate observation of Markov sources. IEEE Transactions on Information Theory, IT-28:729-734, 1982.
[Poritz, 1988] A. Poritz. Hidden Markov Models: a guided tour. In International Conference on Acoustic, Speech
and Signal Processing, pages 7-13, 1988. [Prat, 1995]
F. Prat. Distorsión estocástica de cadenas de símbolos mediante dos métodos basados en modelos de Markov ocultos. Research report, Universidad Politécnica de Valencia, 1995.
[Rabiner and Juang, 1986] L. Rabiner and B. Juang. An introduction to Hidden Markov Models. IEEE Acoustic, Speech, and Signal Processing Magazine, 3(1):4-16, 1986.
[Rabiner et al., 1983] L. Rabiner, S. Levinson, and M. Sondhi. On the application of vector quantization and Hidden Markov Models to speaker independent, isolated word recognition. Bell Systems Technical Journal, 62:1075-1105, 1983.
[Rabiner, 1989] L. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-286, 1989.
Application of Hidden Markov Models and Hidden Semi-Markov Models to Financial Time Series
Bulla, Jan (2006): Application of Hidden Markov Models and Hidden Semi-Markov Models to Financial Time Series. Published in:
Abstract Hidden Markov Models (HMMs) and Hidden Semi-Markov Models (HSMMs) provide flexible, general-purpose models for univariate and multivariate time series. Although interest in HMMs and HSMMs has continuously increased during the past years, and numerous articles on theoretical and practical aspects have been published, several gaps remain. This thesis addresses some of them, divided into three main topics.
1. Computational issues in parameter estimation of stationary HMMs. The parameters of a HMM can be estimated by direct numerical maximization (DNM) of the log-likelihood function or, more popularly, using the Expectation-Maximization (EM) algorithm. We show how the EM algorithm could be modified to fit stationary HMMs. We propose a hybrid algorithm that is designed to combine the advantageous features of the EM and DNM algorithms, and compare the performance of the three algorithms (EM, DNM and the hybrid). We then describe the results of an experiment to assess the true coverage probability of bootstrap-based confidence intervals for the parameters.
2. A Markov switching approach to model time-varying Beta risk of pan-European Industry portfolios. The motive to take up this topic was the development of a joint model for many financial time series. We study two Markov switching models in a Capital Asset Pricing Model framework, and compare their forecast performances to three models, namely a bivariate t-GARCH(1,1) model, two Kalman filter based approaches and a bivariate stochastic volatility model.
3. Stylized facts of financial time series and HSMMs. The ability of a HMM to reproduce several stylized facts of daily return series was illustrated by Ryden et al. (1998). However, they point out that one stylized fact cannot be reproduced by a HMM, namely the slowly decaying autocorrelation function of squared returns. We present two HSMM-based approaches to model eighteen series of daily sector returns with about 5.000 observations. The key result is that, compared to a HMM, the slowly decaying autocorrelation function is significantly better described by a HSMM with negative binomial sojourn time and Normal conditional distributions.
Inference in Hidden Markov Models
This is the web site for the book of the same name authored by Olivier Cappé, Eric Moulines, and Tobias Rydén, published by Springer in July 2005. The publisher's web page for the book is there.
From here, you will find pointers to the table of contents and a somewhat expanded version of the same thing (with the preface and the first page of each chapter), as well as the book cover (most useless except for fans who want to print T-shirts!) To our greatest shame, there is also a list of known errors which includes a corrected index (our most significant erratum to date). In the future, we also expect to make available some bibliographic material related to HMMs and the matlab/octave code used to implement some of the algorithms described in the book.
MathSciNet features a complete review of the book. The book was also reviewed in the August 2006 issue of the ISI Short Book Reviews and in the November 2006 issue of Technometrics. Google has a page on the book from which you can search terms inside the text (note that the effectiveness of this feature is greatly reduced by the fact that the pages in the central part of the book were scanned upside-down!) Below is what the backcover says about the book.
Hidden Markov models have become a widely used class of statistical models with applications in diverse areas such as communications engineering, bioinformatics, finance and many more. This book is a comprehensive treatment of inference for hidden Markov models, including both algorithms and statistical theory. Topics range from filtering and smoothing of the hidden Markov chain to parameter estimation, Bayesian methods and estimation of the number of states.
In a unified way the book covers both models with finite state spaces, which allow for exact algorithms for filtering, estimation etc., and models with continuous state spaces (also called state-space models) requiring approximate simulation-based algorithms that are also described in detail. Simulation in hidden Markov models is addressed in five different chapters which cover both Markov chain Monte Carlo and sequential Monte Carlo approaches. Many examples illustrate the algorithms and theory. The book also carefully treats Gaussian linear state-space models and their extensions and it contains a chapter on general Markov chain theory and probabilistic aspects of hidden Markov models.
This volume will suit anybody with an interest in inference for stochastic processes, and it is meant to be useful for researchers and practitioners in areas such as statistics, signal processing, communications engineering, control theory, econometrics, finance and more. The algorithmic parts of the book do not require an advanced mathematical background, while the more theoretical parts require knowledge of probability theory at the measure-theoretical level.
Olivier Cappé is Researcher for the French National Center for Scientific Research (CNRS). He received the Ph.D. degree in 1993 from Ecole Nationale Supérieure des Télécommunications, Paris, France, where he is currently a Research Associate. Most of his current research concerns computational statistics and statistical learning.
Eric Moulines is Professor at Ecole Nationale Supérieure des Télécommunications (ENST), Paris, France. He graduated from Ecole Polytechnique, France, in 1984 and received the Ph.D. degree from ENST in 1990. He has authored more than 150 papers in applied probability, mathematical statistics and signal processing.
Tobias Rydén is Professor of Mathematical Statistics at Lund University, Sweden, where he also received his Ph.D. in 1993. His publications include papers ranging from statistical theory to algorithmic developments for hidden Markov models.
ADN Markov Ocults 3 problemes Referencies Apendix +refer Mendel
Museu de G. Mendel, Brno
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica
ADN Markov Ocults 3 problemes Referencies Apendix +refer Mendel
Hort de G. Mendel, Brno
Merce Farre Cadenes de Markov Ocultes Aplicacions a la genomica