arthur kunkle - casmy.fit.edu/~vkepuska/ece5526/hw/2008/arthur kunkle/hw4.doc · web viewin the...

Arthur Kunkle

ECE 5526

HW #4

Problem 1

Each HMM was used to generate and visualize a sample sequence, X. These are the outputs from each HMM.

HMM1 HMM2

HMM3 HMM4

HMM5 HMM6

Questions:

1. The following characterize a correct transition matrix:a. Has dimensions of the amount of statesb. First column is all 0 (initial state cannot be transitioned to)c. Last row in all 0 except final entry (probability is 1 to enter final state)d. All entries in a row or column (except first column) sum to 1

2. The transition matrix will effect the “duration” of emiisions within particular classes or groups of classes. In the above output visualizations, especially in HMM’s 4-6, the sample chains tend to occur in class clumps.

3. Without a final state, the observation sequence length would be unbounded.

4. A single HMM is specified by:a. D-dimensional mean vector for each stateb. DxD-dimensional variance matrix for each statec. NxN-dimensional transition matrix for all transitionsd. N-dimensional initial state probability vector

Total parameters: D + N + D^2 + N^2

5. A word would use a left-right model. The sequence of phones would be fixed, with a probability of repeating the same phone or longer utterances, which is also supported by this model type.

More Questions:

1. log(a+b) = log(a) + log(1 + e^(log(b) – log(a)) = log(a * (1 + e^(log(b) – log(a)) = log(a + a*e^(log(b/a)) = log(a + a * (b/a)) = log(a + b)

If log(a) > log(b), the second implementation would be better suited. This is because the difference in the exponential would yield a result that is much less likely to be asymtotic to zero (where the derivative becomes much sharper).

ln(x)

2. log(alpha_t(j)) = log(b_j(x_t)) + log(sum(alpha_t(i) * a_ij)) = log(b_j(x_t)) + sum(log(alpha_t(i) * a_ij)) = log(b_j(x_t)) + sum(log(alpha_t(i)) + log(a_ij))

The biggest performance gain for this converstion is the ability to perform repeated additions instead of multiplications. Because the amount of state transitions can be very large for some HMM’s, this is a critical gain.

Problem 2

1. Bayesian classification is based upon calculating the posteriori probability. In order to use this (as was done in Problem 6 in HW #3), the prior probabilities of all possible state sequences will need to be known. Previously, the probability of the classes q_k were given or calculated ahead of time. In the case of the forward algorithm, we are generating all the possible joint-probabilities for each sequence. The most significant assumption that must be made for Bayesian classification is the independence of the feature vectors. Also, the general form of the PDF of the distribution of classes (or state sequences) must be known.

Problem 3

The plotted sample points for sequence X1-X6:

Log probabilities obtained using the logfwd routine. The most likely model was chosen based upon the greatest probability for each model. The values shown are divided by 10^3.

Sequence

logP(X|1) logP(X|2)

logP(X|3)

logP(X|4)

logP(X|5)

logP(X|6)

MostLikelyModel

X1 -0.5594 -0.6212 -1.4398 -1.4211 -1.2945 -0.9932 1X2 -0.1160 -0.1175 -0.1114 -0.1145 -0.2462 -0.1402 3X3 -0.8265 -0.7878 -1.1802 -1.1543 -0.7410 -1.3291 5X4 -0.8789 -0.8233 -0.8551 -0.8203 -1.1583 -1.0779 4X5 -0.7765 -0.7609 -0.8116 -0.7889 -0.9978 -0.6035 6X6 -1.3968 -1.3228 -1.9053 -1.8520 -2.1105 -2.0234 2

Problem 4

1. In the previously discussed forward algorithm, the alpha vector is updated by multiuplying the existing alpha by the transition probability to each state. Obviously, if an ergodic model is being used with a large amount of states, this operation is very expensive and must be performed for every transition time, t. Performing the log equivilant of this operation reduces the operation to less expensive additions, but the amount of computations needed is still infeasible. With the Viterbi best path approximation, the delta vector is updated only by multiplying the existing vector entry times the maximum transition probability to the next state. If these probabilities are pre-sorted during the model generation, only one multiplication (or addition on log form) must be done per recursion.

2. log(delta_t+1(j)) = log(max(delta_t(i) * a_ij)) + log(b_j * x_t+1) -- log(a*b) = log(a) + log(b)= max(log(delta_t(i) * a_ij)) + log(b_j) + log(x_t+1)

3. The most likely models found earlier were used to find the best path of X1-X6 and plotted:

X1 HMM1 X2 HMM3

X3 HMM5 X4 HMM4

X5 hmm6 X6 HMM2

As shown by the yellow outlines in the data, there are very few differences in original and Viterbi alignments. To show an opposite case, the following shows the alignment difference for an HMM that has a lower probability. Notice the significant differences:

X6 HMM6

4. Likelihoods along the best path using Viterbi algorithm. Notice the most likely model chosen is the same between the forward algorithm and Viterbi.

Sequence logP(X|1)

logP(X|2)

logP(X|3)

logP(X|4)

logP(X|5)

logP(X|6)

MostLikelyModel

X1 -0.5594 -0.6222 -1.4398 -1.4211 -1.2945 -0.9932 1X2 -0.1160 -0.1175 -0.1114 -0.1145 -0.2462 -0.1402 3X3 -0.8267 -0.7879 -1.3478 -1.3169 -0.7410 -1.3760 5X4 -0.8789 -0.8233 -0.8551 -0.8203 -1.6207 -1.0779 4X5 -0.7768 -0.7610 -0.8116 -0.7890 -0.9978 -0.6035 6X6 -1.3968 -1.3228 -3.1193 -3.0588 -3.3987 -2.0438 2

Difference between log-likelihoods along the best path and using forward algorithm. The largest differences are highlighted in red.

Sequence HMM1 HMM2 HMM3 HMM4 HMM5 HMM6

X1 0.0001 0.0009 0.0000 0.0000 0.0000 0.0000X2 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000X3 0.0002 0.0001 0.1676 0.1626 0.0000 0.0469X4 0.0000 0.0000 0.0000 0.0000 0.4624 0.0000X5 0.0003 0.0001 0.0000 0.0000 0.0000 0.0000X6 0.0001 0.0000 1.2140 1.2068 1.2882 0.0204

5. Given the extremely low error between the “real” likelihood generated using the forward algorithm and the likelihood along the best path given by the Viterbi algorithm, this method is a very good approximation of the real likelihood.

Problem 5

1. A left-right type HMM would be the best choice for the word /aiy/. This is because the sequence of phones that make up this word are sequential and do not make sense to repeat. For instance, if an ergodic model was chosen, the sequence “ay-eh- ay-eh” would be modeled, which does not make sense to model the word /aiy/. The duration of each of the phones (ay and eh) would be modeled by transition to self. There would be two additonal start and end states:

2. To generate the HMM parameters, we have to generate the Emission probabilities in each state as well as the transition properties for each state.

a. Because the phone labeling for each sample value in the data is already done, we can simply use the existing data to estimate the mean and the variance for the two classes (or others, depending on how the word /aiy/ is broken down). This was the procedure used in Problem 2 of HW #3. This will determine the B Matrix for our HMM.

b. Choose an initial A Matrix defining the probabilities to transition to each state. Three different probabilities will be chosen (a01 and a23 are predefined to 1). Now, we can use the Baum-Welch Re-estimation with Multiple Observations (discussed in Chapter 5). This procedure will continuously refine the HMM model parameters. After each iteration and recalculation of the parameters, check against the previously used model to determine if a critical point has been reached. If so, we are using the maximum likelihood HMM. We can use all of the observations given in the training data for this (or only some if the maximum is reached early).

3. If the training data is not labeled prior to training, some type of model generation method will have to be done, as was the case in Problem 8 of HW3. Either of the three methods: K-means, Viterbi-EM, or EM can be used on the training data. If we know the amount of classes we are using (which we are assuming 2, ay and eh), one of these methods will provide the Gaussian model parameters for each state. As above, we can then choose initial state transition probabilities and apply the Baum-Welch Re-estimation with Multiple Observations method to refine the parameters over all training data.

Start

ay

eh

End

arthur kunkle - casmy.fit.edu/~vkepuska/ece5526/hw/2008/arthur kunkle/hw4.doc · web viewin the...

Documents