appendix a dialogue acts formats978-1-4471-4923-1/1.pdf · for adiscussion on various taxonomies)....

Appendix ADialogue Acts Formats

Dialogue acts provide a representation of the underlying meaning of a user’sutterance. Various dialogue act taxonomies have been developed (See Traum (2000)for a discussion on various taxonomies). The dialogue acts used in this thesis followthe CUED dialogue act taxonomy.

CUED dialogue acts are represented as the combination of a dialogue act typefollowed by a (possibly empty) sequence of dialogue act items,

acttype(a = x, b = y, ...︸︷︷︸

act items

).

The acttype denotes the type of dialogue act, for example request, inform,or confirm. The act items, a = x, b = y, etc., will be either attribute-value pairssuch as type = Chinese or simply an attribute name or value. Examples ofthe latter cases include request(addr), meaning “What is the address?” andinform(=dontcare), meaning “I don’t care”.

A complete listing of all dialogue act types and their meaning is given in Table A.1.

B. Thomson, Statistical Methods for Spoken Dialogue Management, Springer Theses, 105DOI: 10.1007/978-1-4471-4923-1, © Springer-Verlag London 2013

106 Appendix A: Dialogue Acts Formats

Table A.1 The CUED Dialogue act set, reproduced from Schatzmann (2008)

Act System User Description

hello()√ √

start dialoguehello(a = x, b = y,...) × √

start dialogue and give information a = x, b = y,...silence() × √

the user was silentthankyou() × √

non-specific positive response from the userack() × √

back-channel eg “uh uh”, “ok”, etcbye()

√ √end dialogue

hangup() × √user hangs-up

inform(a = x, b = y,...)√ √

give information a = x, b = y,...inform(name = none)

√ × inform that no suitable entity can be foundinform(a! = x,...) × √

inform that a is not equal to xinform(a = dontcare,...) × √

inform that a is a “don’t care” valuerequest(a)

√ √request value of a

request(a, b = x,...)√ √

request value for a given b = x ...reqalts() × √

request alternative solutionreqalts(a = x,..) × √

request alternative consistent with a = x,...reqalts(a = dontcare,..) × √

request alternative relaxing constraint areqmore()

√ × inquire if user wants anything morereqmore(a = dontcare)

√ × inquire if user would like to relax areqmore() × √

request more information about current solutionreqmore(a = x, b = y,...) × √

request more info given a = x, b = y...confirm(a = x, b = y,...)

√ √confirm a = x, b = y,...

confirm(a! = x,...)√ √

confirm a! = x etcconfirm(name = none) × √

confirm that no suitable entity can be foundconfreq(a = x,..., c = z, d)

√ × confirm a = x,..., c = z and request value of dselect(a = x, a = y)

√ × select either a = x or a = yaffirm()

√ √simple yes response

affirm(a = x, b = y,...)√ √

affirm and give further info a = x, b = y,...negate()

√ √simple no

negate(a = x)√ √

negate and give corrected value for anegate(a = x, b = y,...)

√ √negate(a = x) and give further info b = y,...

deny(a = x, b = y) × √no, a! = x and give further info b = y,...

repeat()√ √

request to repeat last acthelp() × √

request for helprestart() × √

request to restartnull()

√ √null act—does nothing

Appendix BProof of Grouped Loopy Belief Propagation

Consider the update for the portion of a factor graph shown in Fig. B.1. The vari-ables under consideration are X = (X1, X2, . . . , Xm), variable values are x =(x1, x2, . . . , xm), the approximating function q(xi) and the cavity distribution isq\(xi). Note that the identification of the factor, β, has been removed for simplicity.

Now assume a partitioning of the values of each variable. The partition for Xi isdenoted here by the sets Zi,1, Zi,2, . . . , Zi,Ni , where the sets are mutually exclusive andtheir union is the set of all values for Xi. In the k-best approach discussed in Chap. 4these would consist of a series of singleton sets along with a set of all remainingoptions.

The Expectation Propagation algorithm is now applied using an approximationwhere q(x) = ∏

i qi(xi) and q(xi) = qi,η for all xi ∈ Zi,η . The qi,η are the parametersof this approximation.

The target and approximating distributions are defined as with standard LBP,

p∗(x) = f (x)∏

i

q\(xi),

q(x) =∏

i

qi(xi).

The quantity that needs to be minimized is the KL-divergence,

KL(p∗||q) =∑

x

p∗(x) logp∗(x)

q(x),

= α −∑

x

[

f (x)∏

i

q\(xi)

]

log∏

i

qi(xi),

where α is a constant.


http://dx.doi.org/10.1007/978-1-4471-4923-1_4

108 Appendix B: Proof of Grouped Loopy Belief Propagation

Fig. B.1 A portion of a factorgraph X2

Xm

X1

X3…

The KL-divergence must be minimized subject to the constraints∑ki

η=0 |Zi|qi,η =1 for each i. Adding Lagrange multipliers, λi, gives the following objective:

ζ = KL(p∗||q) +∑

i

λi

⎛

⎝1 −ki∑

η=0

|Zi|qi,η

⎞

⎠ . (B.1)

Setting the derivative with respect to ql,η to 0 gives

λl|Zl|ql,η =∑

x:xl∈Zl,η

f (x)∏

i

q\(xi). (B.2)

Many of the q\(xi) factors in the above sum will be the same by virtue of beingin the same group. These values can therefore be factored out of the sum. To writethis as an equation some more notation is defined. First, a set is needed to define allvectors of partition indices, called combinations of the groups:

C = {c = (c1, c2, . . . , cm) : ci = 0, 1, . . . , ki for each i} . (B.3)

When only a given group, Zl,η , is going to be considered for one variable whilethe other variables roam freely, one obtains the set:

Cl,η = {c ∈ C : cl = η} . (B.4)

Given a combination of groups, c ∈ C, it will be necessary to sum all x valuesthat correspond to c. The iteration and resulting sum are defined by,

Yc = {

x : xi ∈ Zi,ci for every i}

(B.5)

f (c) =∑

x∈Yc

f (x) (B.6)

Using this new notation, Eq. B.2 can be simplified by rearranging and factoringout the constant q\(Zi,η) terms:

Appendix B: Proof of Grouped Loopy Belief Propagation 109

λl|Zl|ql,η =∑

c∈Cl,η

∏

i

q\(Zi,η)∑

x∈Yc

f (x), (B.7)

=∑

c∈Cl,η

∏

i

q\(Zi,η)f (c) (B.8)

Hence,

ql,η ∝ 1

|Zl|∑

c∈Cl,η

∏

i

q\(Zi,η)f (c), (B.9)

ql(Zl,η) ∝ 1

|Zl|∑

c∈Cl,η

∏

i �=l

q\(Zi,η)f (c). (B.10)

The update equation that results from minimizing the KL-divergence using thisapproximation is very similar to the standard LBP equations. Instead of iterating overall the combinations of values, the iteration is over the combinations of groups.

Appendix CExperimental Model for Testing BeliefUpdating Optimisations

In order to do an experimental analysis of the effect of the number of time-slicesin the loopy belief propagation algorithm, a set of simple Bayesian networks werebuilt. These give an abstraction of the type of structures that are present in dialoguesystems. Each time-slice of the network consists of a tree of goal nodes with depthD and branching factor B, along with an observation node for each goal node. Eachnode may take B + 1 different values, labeled “N/A”, 1, 2, . . . , B.

The observation nodes each depend on their associated goal node with observationprobability given by:

p(o|g) ={ 5

B+5 if o = g

1B+5 otherwise

. (C.1)

Each goal node has an associated parent goal, denoted gp. Each value of the parentgoal is associated with exactly one child node. The child goal will only take the values1, . . . , B in cases when the parent goal has this value, labeled a. In all other casesthe child goal’s value will be “N/A”. The probability of moving to a particular goalg′ given the previous goal g is then:

p(g′|g, gp �= a) ={

1 if g′ = “N/A”

0 otherwise, (C.2)

p(g′|g, gp = a) =

⎧

⎪⎪⎨

⎪⎪⎩

0 if g′ = “N/A”5

B+4 if g′ = g �= “N/A”

1B+4 otherwise

. (C.3)

In the first turn, a simple uniform distribution is used instead of the last two proba-bilities above—i.e. p(g′|g, gp = a) = 1

B for g′ = g �= “N/A”.


112 Appendix C: Experimental Model for Testing Belief Updating Optimisations

This abstraction allows several tests of how the different algorithms compare inspeed for tasks of various complexity. Computation times using a depth of 0 and 2are given in Figs. 4.2 and 4.3 of Section.

http://dx.doi.org/10.1007/978-1-4471-4923-1_4

http://dx.doi.org/10.1007/978-1-4471-4923-1_4

Appendix DThe Simulated Confidence Scorer

The simulated confidence scorer is based on a few key assumptions

• There are A possible dialogue acts• The size of the output list is known and is denoted Ne

• For each item in the list:

– There is a constant probability of confusion, er , called the confusion rate.– If confused, then an act is equally likely to be confused into any other act. i.e.

the probability is 1A−1 for every other act. (Note that this is a gross assumption

and is not how the confusion model operates).

The confidence score is computed using Bayes’ theorem on the number of timesthe act is generated. Suppose the true act was a. Let a be a sequence of N acts, wherea1 observed n1 times, a2 is observed n2 times, . . . , and am is observed nm times.Then:

P(a|a = ak) = (1 − er)nk

(

er

A − 1

)Ne−nk

(D.1)

P(a|a �∈ {a1, . . . , am}) = (er/(A − 1))Ne (D.2)

The confidence score generator will assign confidence equal to p(ak |a). Assuminga uniform prior probability for all actions gives:

P(a �∈ {a1, . . . , am}) = A − m

A(D.3)

P(a = ak) = 1

Ai ∈ 1 . . . m (D.4)

Bayes rule can then be used with the above equations to give p(ak |a):

P(ak |a) = P(a|ak)P(ak)

P(a)(D.5)


114 Appendix D: The Simulated Confidence Scorer

=1A (1 − er)

nk( er

A−1

)Ne−nk

1A

∑mi=1(1 − er)ni

( erA−1

)Ne−ni + A−mA (er/(A − 1))Ne

(D.6)

In all simulations, the error rate for simulating confidence scores was set to 0.2,in order to force the simulator to behave as if it did not have knowledge of theenvironment.

Appendix EMatching the Dirichlet Distribution

To run the Expectation Propagation algorithm with a Dirichlet approximation, asin Chap. 7, one must find parameters α∗ (of dimension Nα) to minimize KL(p∗|q),where:

q(θ) ∝Nα∏

j=1

θα∗

j −1

j , (E.1)

p∗(θ) ∝ w0Dir(α) +Nα∑

j=1

wjθjDir(θ;α), (E.2)

Dir(θ;α) = �(∑Nα

j=1 αj)

∏Nαj=1 �(αj)

Nα∏

j=1

θαj−1. (E.3)

Here, Dir(θ;α) denotes the Dirichlet distribution and � is the gamma function:

�(z) =∫ ∞

0tz−1 exp−t dt (E.4)

The gamma function has one useful property which will be used here, obtainedby using differentiation by parts on the definition:

�(z) = (z − 1)�(z − 1) (E.5)

Denoting δj as the zero vector with 1 at position j, one can rewrite the componentsof p∗ as follows:

wjθjDir(θ;α) ∝ wjθj�(

∑

i αi)∏

i �(αi)

∏

i

θαi−1i (E.6)


http://dx.doi.org/10.1007/978-1-4471-4923-1_7

116 Appendix E: Matching the Dirichlet Distribution

∝ wj�(

∑

i αi)∏

i �(αi)θαjj

∏

i �=j

θαi−1i (E.7)

∝ wj�(

∑

i αi)∏

i �(αi)

�(αj + 1)∏

i �=j �(αi)

�(∑

i αi + 1)Dir(θ;α + δj) (E.8)

∝ wj�(

∑

i αi)

�(αj)

�(αj + 1)

�(∑

i αi + 1)Dir(θ;α + δj) (E.9)

∝ wj�(

∑

i αi)

�(αj)

αj�(αj)

(∑

i αi)�(∑

i αi)Dir(θ;α + δj) (E.10)

∝ wjαj

∑

i αiDir(θ;α + δj) (E.11)

Hence p∗ can be written as a mixture of Dirichlet distributions:

p∗(θ) = w∗0Dir(θ;α) +

∑

j

w∗j Dir(θ;α + δj), (E.12)

where,

w∗0 ∝ w0, (E.13)

w∗j ∝ wj

αj∑

i αi, (E.14)

Nα∑

i=0

w∗i = 1. (E.15)

Suppose then that q(θ) ∼ Dir(α∗). The function that must be minimized is:

KL(p∗|q) = −∫

θp∗(θ) log

q(θ)

p∗(θ)dθ, (E.16)

= k1 −∫

θp∗(θ) log q(θ)dθ, (E.17)

= k1 − w∗0

∫

θDir(θ;α) log Dir(θ;α∗)dθ (E.18)

−∑

i

w∗i

∫

θDir(θ;α + δi) log Dir(θ;α∗)dθ, (E.19)

where k1 is an arbitrary constant. Now,

Appendix E: Matching the Dirichlet Distribution 117

∂

∂α∗j

∫

θDir(θ;α) log Dir(θ;α∗) (E.20)

= ∂

∂α∗j

∫

θDir(θ;α)

[

log �

(

Nα∑

i=1

α∗i

)

−Nα∑

i=1

�(α∗i ) +

Nα∑

i=1

(α∗i − 1) log(θi)

]

(E.21)

= ∂

∂α∗j

[

log �

(

Nα∑

i=1

α∗i

)

−Nα∑

i=1

�(α∗i ) +

Nα∑

i=1

(α∗i − 1)EDir(θ;α) log(θi)

]

(E.22)

A well-known property of the Dirichlet distribution, Dir(θ;α), is that:

E(log θj) = �(αj) − �

(

Nα∑

i=1

αi

)

.

where � is the digamma function,

�(z) = d

dzlog �(z).

Hence:

∂

∂α∗j

∫

θDir(θ;α) log Dir(θ;α∗) = �

(

Nα∑

i=1

α∗i

)

− �(α∗j ) − �

(

Nα∑

i=1

αi

)

+ �(αj)

(E.23)Setting the derivative of the full KL divergence (Eq. E.19) with respect to θ∗

j to 0gives:

0 =Nα∑

i=0

w∗i

(

�(α∗j ) − �

(

Nα∑

k=1

α∗k

))

− w∗0

(

�(αj) − �

(

Nα∑

i=1

αi

))

−Nα∑

i=1

w∗i

(

�(αj + δij) − �

(

Nα∑

k=1

αk + 1

))

Using the fact that∑Nα

j=1 w∗j = 1, one obtains for every j that:

�(α∗j ) − �

⎛

⎝

Nα∑

k=1

α∗k

⎞

⎠

= w∗0

⎛

⎝�(αj) − �

⎛

⎝

Nα∑

i=1

αi

⎞

⎠

⎞

⎠ +Nα∑

i=1

w∗i

⎛

⎝�(αj + δij) − �

⎛

⎝

Nα∑

k=1

αk + 1

⎞

⎠

⎞

⎠ (E.24)

118 Appendix E: Matching the Dirichlet Distribution

The digamma function, �(z), has a useful recurrence relation which will now beused to simplify the above equation:

�(z + 1) = ∂

∂zlog �(z + 1) = ∂

∂z(log �(z) + log z) = �(z) + 1

z(E.25)

After using this property, Eq. E.24 becomes:

�(α∗j ) − �

(

Nα∑

k=1

α∗k

)

= w∗0

(

�(αj) − �

(

Nα∑

i=1

αi

))

+Nα∑

i=1

w∗i

(

�(αj) + δij

αj− �

(

Nα∑

k=1

αk

)

− 1∑Nα

k=1 αk

)

= �(αj) − �

(

Nα∑

k=1

αk

)

+ w∗j

αj− 1 − w∗

0∑Nα

k=1 αk

A suitable α∗ can found to match this using standard techniques. Further detailsare given in Chap. 7 and also in Sect. 3.3.3 of Paquet (2007) and Minka (2003).

Appendix FConfidence Score Quality

Effective confidence scores are a key requirement for handling uncertainty in anyspoken dialogue system. Partially observable systems are particularly susceptible tothe effects of confidence score quality. When a confidence score is low, a partiallyobservable model will place very little belief in the corresponding hypothesis. Ifthe confidence scores are frequently low for correct hypotheses then a sophisticatedmodel using the scores may be worse off than one which ignores them.

Unfortunately, the use of multiple hypotheses in spoken dialogue systems is arelatively new area of research. As a result, many speech recognisers give unsatis-factory confidence scores when asked to provide an N-best list of speech recognitionhypotheses. Even if the speech recognition confidence scores are useful, it is notobvious how to join confidence scores from speech recognition hypotheses givingthe same semantics.

This appendix will introduce various evaluation metrics for evaluating semanticconfidence scores with multiple hypotheses. The use of these metrics allows oneto ensure that the system is given confidence scores that are beneficial to perfor-mance and not harmful. The metrics in this appendix are based on similar ideasto the log-loss and Brier scores discussed in Bohus (2007). The appendix begins byintroducing a few common confidence scoring techniques and then lays a frameworkfor evaluation metrics. A series of metrics for evaluating such annotation schemesare introduced and in the process, two key characteristics are established as pre-requisites for a suitable metric. Four new metrics are proposed which have both ofthese characteristics. The appendix concludes with an example evaluation of variousconfidence annotation schemes.

Note that throughout this appendix, the symbols used will have different meaningsto the rest of the thesis. In particular a, B, h and u should not be confused withthe system action, system beliefs, history and user action as defined before. In thisappendix only they will denote different concepts.


120 Appendix F: Confidence Score Quality

F.1 Generating Confidence Scores

In a spoken dialogue system, the list of dialogue acts is computed by the semanticdecoder, and depends on the list of Speech To Text (STT) hypotheses. The confidencescores of these acts must clearly depend on the confidences of the STT hypotheses.Assigning confidences to speech recognition outputs is an active area of research andthere are several approaches which can be used.

One technique for choosing speech recognition confidence scores is to first con-struct the confusion network from lattices output by the speech recogniser (Evermannand Woodland 2000). Each word arc in the confusion network has a log posteriorassociated which is used in a dynamic programming search to construct an N-Bestlist. The summation of these log posteriors is called the inference evidence and afterexponentiating and renormalising this is used for the sentence-level score. Con-fidences on the semantics can then be calculated by summing the sentence-levelscores for all sentences which are parsed as the same dialogue act. This approachwill be denoted by InfEv.

An alternative approach, denoted by AvgWord, is to calculate the average ofthe word level confidence scores in the speech recognition lattice path resulting inthe given speech hypothesis. The appendix will also discuss the results from a thirdbaseline approach, denoted Const, which simply assigns a constant confidence to allhypotheses. The reader should note that many other confidence annotation schemeshave been developed and the above are used only as a demonstration of how themetrics developed can be used for comparison. A review of confidence annotationschemes is given in Jiang (2005).

Once confidence scores have been assigned to the STT hypotheses, suitable con-fidence scores must be found for the output of the semantic decoder. Two possibleapproaches are discussed here. One adds the confidences from STT hypotheses whichresult in the same semantics (Sum), while the other chooses the maximum STT con-fidence (Max). In both cases the resulting confidences are then rescaled so thatthe sum of all confidence scores is 1. In association with the three STT confidencescoring schemes, this results in five different overall schemes, presented below (theConst- Max scheme is not considered).

AvgWord-Sum Average of all word-level confidence scores is used for eachSTT hypothesis, with confidences summed from hypotheses resulting in the samesemantics,AvgWord-Max Average of all word-level confidence scores is used for each STThypothesis, with the maximum confidence score used for hypotheses resulting inthe same semantics,Const-Sum A count of the number of STT hypotheses resulting in the samesemantics is used for each possible semantic hypothesis,InfEv-Sum Exponentiated inference evidence is used for each STT hypothesis,with confidences summed from hypotheses resulting in the same semantics,

Appendix F: Confidence Score Quality 121

InfEv-Max Exponentiated inference evidence is used for each STT hypothesis,with the maximum confidence score used for hypotheses resulting in the samesemantics.

Note that in all cases, the scores are renormalised to sum to 1. The probability thatthe hypothesis is not in the list is left for the dialogue manager to handle. In futurework it will be necessary to estimate a probability for this as well.

F.2 An Evaluation Framework

The characteristics of the different metrics that will be defined can be observed byevaluating the results of different speech recognition and semantic decoding config-urations for a fixed data set. To this end, evaluations in this appendix use a corpusof 648 dialogues with semantic and speech recognition transcriptions. A detaileddescription of the trial used to produce this corpus is described in Chap. 6. All exper-iments are off-line and use the same corpus, which is called TownInfo. In order totest the effects of decreases in recognition performance, synthetic noise was addedto the data offline. Three versions of the corpus were computed, corresponding toSignal to Noise Ratios (SNR) of 3.5 dB (High noise), 10.2 dB (Medium noise) and35.3 dB (No noise).

Evaluation of a semantic parser is similar to the evaluation of any other classifierwith multiple outputs. In evaluating a speech recogniser, for example, one compareswords with a reference transcription. In the case of a semantic parser either thedialogue acts as a whole or the semantic items are compared.

Central to the evaluation is the format used for dialogue acts. This typicallydepends on the task and there is no generally accepted standard. One approach,which will be used here is the one introduced in Sect. 2.1.2. Dialogue acts are com-posed of a series of semantic items whose order is unimportant. These semantic itemsmight represent attribute-value pairs or more abstract dialogue act types, which dis-tinguish for example whether the utterance was requesting or giving information. Anexample utterance, along with reference dialogue acts and semantic items as wellas act and item hypothesis lists are given in Table F.1. A detailed description of thedialogue act formats used throughout the thesis is given in Appendix A.

The use of either exact matches of dialogue acts or partial matches given bycounting the matching semantic items give rise to two sets of metrics. Matches at adialogue act level may be more appropriate if there are strong dependencies betweensemantic-items whereas item-level matching may give a better overall evaluationof the semantic parser. If the confidences are given only at an act level, they areconverted to an item level score by summing the confidences over acts containingthe item.

When defining the item-level metrics it is simpler to consider the set of all semanticitems rather than just those hypothesised. Semantic decoding then becomes a taskof choosing whether the semantic item is correct for a given utterance. In practice,

http://dx.doi.org/10.1007/978-1-4471-4923-1_6

http://dx.doi.org/10.1007/978-1-4471-4923-1_2


Table F.1 Example utterance with the reference dialogue act (Ref. Act), reference semantic items(Ref. Items), example act hypothesis list (Hyp. Acts) and semantic item hypothesis list (Hyp. Items)

Utterance: I’d like um an expensive hotel please

Ref. Act: inform(type=hotel, pricerange=expensive)Ref. Items: (inform, type=hotel, pricerange=expensive)Hyp. Acts: inform(type=hotel, pricerange=expensive) 0.9

inform(type=hotel, pricerange=inexpensive) 0.1Hyp. Items: inform 1.0

type=hotel 1.0pricerange=expensive 0.9

pricerange=inexpensive 0.1

Confidences scores are shown in the third column

implementation may restrict calculations to the semantic items actually hypothesisedor in the reference but conceptually matches are compared by summing over allpossibilities.

Most of the notation that will be used in definitions is common to all metrics.Starting with an item-based approach, let the number of utterances be U and let Wdenote the number of all available semantic items. Given u = 1 . . . U and w = 1 . . . Wlet:

cuw =⎧

⎨

⎩

Confidence assigned to the hypothesis that thewth semantic item is part of utterance u,0 if none was assigned

δuw ={

1 if the wth item is in the reference for u0 otherwise

Nw = Total number of reference semantic items,

=∑

uw

δuw.

In the example from Table F.1, the confidences cuw are all zero except for those cor-responding to the semantic items “inform”, “type=hotel”, “pricerange=expensive”and “pricerange=inexpensive” which are 1.0, 1.0, 0.9 and 0.1, respectively. In thecase of metrics defined at an act-level, a slight variation in notation is used. Let thenumber of hypothesised or reference acts be H and denote for h = 1 . . . H:

cuh ={

Confidence assigned to the hth act being the correctparse for utterance u, 0 if none was assigned

δuh ={

1 if the hth act is the correct parse for u0 otherwise


F.3 Confidence Scoring Metrics

F.3.1 Weighted Confidence Scores

A simple possibility for evaluating confidence scores is to weight traditional metricsto take account of the confidence. Whatever error function is used is replaced withan expected value over the confidence scores. Similarly the number of hypothesiseditems is replaced with an expected number.

One example of this approach is to convert the semantic error rate into a confidenceweighted form. For each act a hypothesised for utterance u, the items contained ina are matched with the items contained in the reference and the sum of the itemsubstitutions, deletions and insertions are calculated and denoted eua. The confidenceweighted semantic error rate is then:

WSER = 1

Nw

∑

u,h

cuheuh. (F.1)

When using confidence-weighted metrics for evaluation, it soon becomes obviousthat good confidence scores are not necessarily reflected in an improved score. Asshown in Fig. F.1, confidence weighted error rates actually increase with the numberof hypotheses. This is counter-intuitive since the larger list has more information andshould perform better.

A theoretical explanation for this issue comes by examining the choices made bythe confidence scorer. Suppose that the scorer has some beliefs B about the semanticsof each utterance and aims to optimise the expected value of the metric under itsbeliefs. Under the error rate metric this corresponds to optimising:

Fig. F.1 Plot of confidence weighted semantic error (WSER) against size of the STT N-best list.Confidence scores are calculated using the InfEv- Sum


E

⎛

⎝

∑

u,h

cuheuh|B⎞

⎠ =∑

u,h

cuhE(euh|B). (F.2)

Given the constraints∑

h cuh = 1 and cuh ≥ 0, the optimum is achieved by settingcuh∗ = 1 for the hypothesis h∗ with minimum expected error. Added hypotheses willalways result in worse expected semantic error rates. This suggests severe deficienciesin the metric as no credit is being given to the accuracy of the confidence scores.Confidence weighted recall, precision or F-scores can also be defined but suffer fromsimilar problems. Figure F.2 shows how the confidence weighted recall and precisionboth degrade with large N-best lists.

Fig. F.2 Plot of confidence weighted recall (WRcl) and precision (WPrc) against size of the STTN-best list. Confidence scores are calculated using the InfEv- Sum scheme


F.3.2 NCE Scores, Oracle Rates and Other Metrics

One common metric for evaluating speech recognition confidences is the normalizedcross entropy (NCE). This was the method used for several NIST evaluations anddetails of its application to other natural language processing tasks may be found inGandrabur et al. (2006). An equation for the item-level form of NCE is:

NCE = Hbase + ∑

u,w log(δuwcuw + (1 − δuw)(1 − cuw))

Hbase.

where Hbase = nc log pc +(Nh −nc) log(1−pc), pc = ncNh

, nc is the number of correctsemantic items from this list of hypotheses and Nh is the number of hypothesiseditems (hypothesised items are those with cuw > 0).

The reason for normalising by Hbase is to adjust for the overall probability ofcorrectness to enable comparisons between data sets. Hbase gives the entropy thatwould be obtained by simply using the constant probability, pc. This normalisationterm, however, depends on the number of hypothesised items. The score can beincreased by simply adding more hypothesised items with very low probability.NCE is thus a suitable metric for evaluating the accuracy of probability estimatesgiven a set of hypotheses, but it does not necessarily test the overall correctness ofthe output.

A useful measure of correctness is the oracle error rate, which measures the errorrate that would be achieved if an oracle chose the best option from each hypothesisedlist of dialogue acts. This gives an upper bound on the error that could be achievedfor a given list of hypotheses. Unfortunately, it is clearly not appropriate as an overallmetric since confidence scores are ignored.

The problems with these two metrics can be observed in Table F.2, which showsthe value of the metrics on various confidence scoring configurations. Comparing theresults on high and low noise with large and small N-best lists (lines 1 and 2) showsthat the system’s oracle error rate degrades significantly in the presence of noise. Thiseffect is completely disregarded by the NCE score, which suggests that the resultsin the first line are preferable. This is because it evaluates only the confidence scores

Table F.2 Comparison of normalised cross entropy (NCE) and Oracle error rates (ORA) for speechunderstanding with multiple hypotheses

Noise Conf. N- Metric(db) Calc. Best ORA NCE

3.5 InfEv-Sum 100 25.8 0.21235.3 InfEv-Sum 1 16.4 −0.55635.3 Const-Sum 100 8.2 –35.3 InfEv-Sum 100 8.2 –

All experiments use the InfEv-Sum scheme for computing confidence scores. NCE shows improve-ments as decreases in the metric value, while ORA shows improvements as increases in the metricvalue


and not the overall correctness. In fact, one can show that two systems which bothgive a constant confidence score equal to the overall average correctness will receivethe same NCE score, regardless of what that average level of correctness is. Theproblem with the Oracle Error Rate can also be seen in the table. Although lines 3and 4 use completely different confidence scorers they receive the same oracle errorrate. The oracle error rate depends only on the acts in the N best list and ignores allconfidence scores.

Another commonly used tool for the joint evaluation of confidence scores andcorrectness is the receiver operating characteristic (ROC) curve (Gandrabur et al.2006). One considers a classifier based on the confidence score which accepts orrejects hypotheses depending on a confidence threshold. The ROC curve then plotsthe number of correct rejections and acceptances. The problem with this approach isthat only the first hypothesis and its confidence is ever evaluated.

F.3.3 Cross Entropy

The traditional metrics discussed above give a way to evaluate either the confidencescores or the overall correctness, but not both. An ideal metric should incorporate bothfactors, as well as giving a good indication of the effect on dialogue performance.This leads to the proposal of a new metric, based on the cross entropy betweenthe probability density from the confidences and the optimal density given by deltafunctions at the correct values. This is very similar to the NCE metric, but does notnormalise for the average probability of correctness. Both Item-level Cross Entropy(ICE) and Act-level Cross Entropy (ACE) metrics can be defined, as is done below.

ICE = 1

Nw

∑

u,w

− log(δuwcuw + (1 − δuw)(1 − cuw)), (F.3)

ACE = 1

U

∑

u,h

− log(δuhcuh + (1 − δuh)(1 − cuh)). (F.4)

Consider now the decisions that the confidence scorer makes to optimise theICE metric, similar to the process in Sect. F.3.1. Assuming that the total number ofreference items Nw is fixed, the scorer must aim to optimise the expected value ofthe metric:

E(ICE|B) = −1

Nw

∑

u,w

[puw log(cuw) + (1 − puw) log(1 − cuw)]. (F.5)

where puw = P(δuw = 1|B). Setting the derivative with respect to cuw equal to zerogives

(puw − cuw)/[cuw(1 − cuw)] = 0. (F.6)


and so the minimum is achieved when cuw = puw. When substituting this optimuminto (F.5), the expected value of the metric is the average entropy of the beliefs B.The metric therefore penalises systems for bad confidence scores as well as givingcredit for bolder predictions. The Act-level Cross Entropy is similarly optimised bychoosing cuh = puh. The difference is only that the ICE metric evaluates at an itemlevel while ACE evaluates at the act level.

Implementations of these cross entropy metrics must take care to avoid numericalinstabilities and undefined values. Whenever a confidence score of 0 is assigned tothe correct hypothesis the metric will be undefined because of the log 0 term whichthen appears. Very low confidence scores for the correct hypothesis will tend to givenumerical instabilities for the same reason.

In practice this metric must therefore be adjusted so as to avoid this instability.Any computation of the log in Eqs. F.3 and F.4 are replaced with a computation whichfirst makes sure that the value inside the log is above a threshold. In all experimentsin this thesis the threshold was chosen to be 0.001.

Fig. F.3 Plots of item cross entropy (ICE) and act cross entropy (ACE) against size of the STTN-best list. Confidence scores are calculated using the InfEv- Sum scheme


Fig. F.4 Plot of the Item L2 norm (IL2) against the size of the STT N-best list. Confidence scoresare calculated using the InfEv- Sum scheme

Figure F.3 shows how the ICE and ACE metrics do give the desired performanceon real data. Both metrics degrade with added noise and improve when the numberof STT hypotheses increases.

F.3.4 L2 Norms

Metrics based on cross-entropy are not the only option for evaluating both the overallcorrectness and the validity of confidence scores simultaneously. An alternative isto use an l2 norm between the hypothesised and reference semantics. Item-level l2

(IL2) and act-level l2 (AL2) metrics would be given by:

IL2 = 1

Nw

∑

u,w

(δuw − cuw)2, (F.7)

AL2 = 1

U

∑

u,h

(δuh − cuh)2. (F.8)

These metrics are also optimised by choosing confidence scores equal to theposterior probability of the item or act. To optimise the AL2 metric, for example, thescorer must minimise the expected value

E(AL2|B) = 1

U

∑

u,h

(puh − 2puhcuh + c2uh), (F.9)

= 1

U

∑

u,h

[

(puh − cuh)2 + puh(1 − puh)

]

. (F.10)


The expected value is optimised by setting cuh = puh and produces an expectedmetric of

1

U

∑

u,h

puh(1 − puh). (F.11)

This value again gives a measure of the overall correctness of the system. Thesystem will only achieve the minimum score of 0 when it is both always correct andalways assigns the correct hypothesis a confidence score of 1.

Similar to the cross-entropy based metrics, the advantages of the two L2 basedmetrics can be seen in experiments. Figure F.4 shows the IL2 metric values fordifferent sizes of N-best list and noise level, using the towninfo corpus. As isexpected increased noise degrades performance while performance improves withthe size of the STT N-best list. Changes in this metric are, however, relatively smalland for this reason the cross-entropy based metrics will be preferred here.

F.4 Comparison of Confidence Scorers

The purpose of developing evaluation metrics for semantic hypotheses is to com-pare different confidence scoring techniques and to ensure that the extra hypothesesdo give improved performance. This section gives an example comparison usingthe metrics to decide on a confidence scoring method. Several simple confidencescoring techniques are evaluated. Each STT hypothesis is assigned a sentence-levelconfidence score which is normalised so that the sum over the N-best list is 1, andthe sentence is then passed to the semantic parser. The parser determines the mostlikely dialogue act for each sentence, then groups together sentences which producethe same dialogue act. All five methods discussed in Sect. F.1 are evaluated.

The results of the comparison are given in Table F.3. The metrics are computedon the towninfo corpus, using no added noise (35.3 dB) and different size N-bestlists. It is clear from all metrics that the use of inference evidence outperformsthe alternatives tested here. The average word-level confidence score appears to bereasonably constant, with little difference to using a constant value for the STT score.The difference between using a maximum and summing the inference evidence scoresis minimal, with slightly improved performance achieved when using a maximum.

The evaluated system of Chap. 6 uses the InfEv-Sum scheme for computing con-fidence scores in an online system. The results of Table F.3 show that this approachshould give reasonable estimates of the probabilities of the available semantichypotheses. This in turn should enable improved performance of the spoken dia-logue system.

http://dx.doi.org/10.1007/978-1-4471-4923-1_6


Table F.3 Comparison of different confidence scoring methods using the ICE, ACE, NCE andTAcc metrics

Conf. Cal. N

1 5 10 50 100

ICE metric

AvgWord-Max 1.283 1.150 1.106 1.096 1.129AvgWord-Sum 1.283 1.114 1.049 1.000 1.002Const-Sum 1.283 1.116 1.051 0.999 1.004InfEv-Sum 1.283 1.079 1.017 0.956 0.951InfEv-Max 1.283 1.077 1.004 0.931 0.923

ACE metric

AvgWord-Max 2.071 1.762 1.709 1.836 1.938AvgWord-Sum 2.071 1.621 1.516 1.502 1.537Const-Sum 2.071 1.618 1.512 1.501 1.537InfEv-Sum 2.071 1.550 1.447 1.437 1.460InfEv-Max 2.071 1.570 1.444 1.406 1.428

NCE metric

AvgWord-Sum −0.556 0.159 0.266 0.395 0.422AvgWord-Max −0.556 0.048 0.150 0.234 0.240Const-Sum −0.556 0.142 0.257 0.392 0.420InfEv-Sum −0.556 0.205 0.323 0.459 0.495InfEv-Max −0.556 0.185 0.320 0.464 0.497

TAcc metric

AvgWord-Sum 83.4 81.4 80.9 79.7 78.5AvgWord-Max 83.4 75.3 73.0 66.6 64.2Const-Sum 83.4 81.7 81.6 79.7 78.5InfEv-Sum 83.4 83.4 83.2 83.0 82.9InfEv-Max 83.4 83.5 83.4 83.4 83.4

The TAcc metrics gives the accuracy (1-semantic error) of the most likely hypothesis. The ACEand ICE metrics show improvements as decreases in the metric value, while the NCE and TAccmetrics show improvements as increases

References

Bohus D (2007) Error awareness and recovery in conversational spoken language interfaces. PhDthesis, Carnegie Mellon University, Pittsburgh

Evermann G, Woodland PC (2000) Large vocabulary decoding and confidence estimation usingword posterior probabilities. In: Proceedings of ICASSP, 2000

Gandrabur S, Foster G, Lapalme G (2006) Confidence estimation for NLP applications. ACM TransSpeech Lang Process 3(3):1–29

Jiang H (2005) Confidence measures for speech recognition: a survey. Speech Commun 45(4):455–470


Minka T (2003) Estimating a Dirichlet distribution. Technical report, MIT, 2003. http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/

Paquet U (2007) Bayesian inference for latent variable models. PhD thesis, University of CambridgeSchatzmann J (2008) Statistical user modeling for dialogue systems. PhD thesis, University of

CambridgeTraum DR (2000) 20 questions on dialogue act taxonomies. J Semant 17(1):7–30

http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/

http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/

Author Biography

Blaise Thomson is a Research Fellow at St John’s College in the University ofCambridge. He obtained a Bachelors degree in Pure Mathematics, Computer Science,Statistics and Actuarial Science at the University of Cape Town, South Africa, beforecompleting an MPhil at the University of Cambridge in 2006 and a PhD in StatisticalDialogue Modelling in 2010. He has published around 35 peer-reviewed journal andconference papers, focusing largely on the topics of dialogue management, automaticspeech recognition, speech synthesis, natural language understanding and collabora-tive filtering. In 2008 he was awarded the IEEE Student Spoken Language Process-ing award for his paper at the International Conference on Acoustics, Speech, andSignal Processing (ICASSP) and in 2010 he co-authored best papers at both theIEEE Spoken Language Technologies workshop and Interspeech. He was co-chair ofthe 2009 ACL Student Research Workshop and co-presented a tutorial on POMDPdialogue management at Interspeech 2009.

In his spare time, he enjoys playing guitar and dancing and represented Englandat the 2010, 2011 and 2012 world formation Latin championships.


Index

AAgenda-based dialogue managers, 14ATK, 8

BBasis functions, 61, 62Bayes’ theorem, 15Bayesian methods, 2Bayesian network, 28Belief state, 10Belief state transition function, 10

CCall-flow, 13Cavity distribution, 38Concepts, 30Confusion rate, 68

DDialogue act, 9Dialogue act items, 9Dialogue act tag, 9Dialogue act type, 9Dialogue cycle, 7, 11Dialogue history, 10Dialogue manager, 10, 13Dialogue policy, 10, 57Dialogue state, 10

Dirichlet, 89Divergence measure, 46Dynamic bayesian

networks (DBNs), 29

EEliza, 1Error simulator, 12Expectation maximization, 83Expectation propagation, 46, 83

FFactor graph, 34, 45Features, 61Form-filling dialogue manager, 14Frame-based dialogue manager, 14Function approximation, 61

GGrid-based features, 63Grouped LBP, 51

HHand-crafted, 13Hidden information state, 40Hidden markov model, 8History nodes, 31

B. Thomson, Statistical Methods for Spoken Dialogue Management, Springer Theses,DOI: 10.1007/978-1-4471-4923-1, � Springer-Verlag London 2013

135

IInformation-seeking dialogues, 14Information state model, 14

LLimited-domain dialogue systems, 1Logic programming, 14Loopy belief propagation (LBP), 36

MMarginal distributions, 35Markov assumption, 15, 27, 41Markov decision process (MDP), 17Master actions, 20Master space, 20Master states, 20Mostly constant factors, 51

NNatural actor critic, 64Natural gradient, 65Natural language generator, 11N-best list, 8

OObservation, 10

PPartially observable Markov decision

process (POMDP), 17Partitions, 47Plan-based dialogue managers, 14Policy learning, 16

QQ-function, 58

RReinforcement learning, 16

SSemantic decoding, 9Slots, 30Speech act, 9Speech recognition engine, 7Spoken dialogue systems, 1Stationary distribution, 15System act, 10System action, 10System state, 10

TText-to-speech (TTS) engine, 11Time-slices, 29, 41TownInfo, 33, 68, 84Turing test, 1Turn, 7

UUnit selection, 11User simulator, 67

VValidity node, 30

WWord-lattice, 8

136 Index

appendix a dialogue acts formats978-1-4471-4923-1/1.pdf · for adiscussion on various taxonomies)....

Documents