extending hidden markov models to set-valued observations

77
Extending Hidden Markov Models to Set-Valued Observations: A Comparision of Different Approaches on Sequences of Facial Expressions Masterarbeit im Studiengang Computing in the Humanities der Fakult¨ at Wirtschaftsinformatik und Angewandte Informatik der Otto-Friedrich-Universit¨ at Bamberg Verfasser: Mark Gromowski (Matrikelnummer: 1576483) Gutachter: Prof. Dr. Ute Schmid

Upload: others

Post on 18-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Extending Hidden Markov Models to Set-ValuedObservations: A Comparision of Different

Approaches on Sequences of Facial Expressions

Masterarbeit

im Studiengang Computing in the Humanities der Fakultat

Wirtschaftsinformatik und Angewandte Informatik der

Otto-Friedrich-Universitat Bamberg

Verfasser: Mark Gromowski (Matrikelnummer: 1576483)

Gutachter: Prof. Dr. Ute Schmid

The present thesis describes the implementation of a classification system based on theconcept of Hidden Markov Models. Combinations of parallel observations that occur inthe course of an observation sequence are handled either by representing them as separatesymbols in the alphabet of possible observations or by using an updated HMM conceptprocessing them as sets of observations. The classification system, which is realized asan extension of the data mining software RapidMiner, is evaluated by applying it tosequences of Action Units representing facial expressions. The system achieves modestresults in classifying facial expressions decoding pain and other sentiments. Generalisingor completely removing AUs that are irrelevant for the classification of pain leads to ahigher recall for the class pain but can result in a decrease of the overall performance.Processes applying the updated HMM concept require more computation time but fewerstates and iterations of the training process in order to achieve their optimal results.

Contents

1. Introduction 1

2. Theoretical background 42.1. Analysing facial expressions of pain . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1. The facial expression of pain as a means of communication . . . . 42.1.2. Facial expression analysis . . . . . . . . . . . . . . . . . . . . . . . 62.1.3. The Facial Action Coding System . . . . . . . . . . . . . . . . . . 82.1.4. Previous approaches of pain classification using FACS . . . . . . . 10

2.2. Classification via Hidden Markov Models . . . . . . . . . . . . . . . . . . 132.2.1. The principles of Machine Learning . . . . . . . . . . . . . . . . . . 132.2.2. The structure of Hidden Markov Models . . . . . . . . . . . . . . . 152.2.3. Probability calculation: the Forward and the Backward Algorithm 182.2.4. Learning: the Baum-Welch-Algorithm . . . . . . . . . . . . . . . . 212.2.5. HMMs as classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3. The implementation of a HMM-based classification system including theprocessing of parallel observations 253.1. Realizing HMM-based classification in RapidMiner . . . . . . . . . . . . . 25

3.1.1. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.2. RapidMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.3. The jahmm library . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1.4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2. Approaches of handling parallel observations . . . . . . . . . . . . . . . . 363.2.1. Extending the alphabet of possible observations . . . . . . . . . . . 363.2.2. Processing sets of observations . . . . . . . . . . . . . . . . . . . . 38

4. Evaluation of the HMM-based classification system 454.1. Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.1. The data basis and its representations . . . . . . . . . . . . . . . . 464.1.2. Process buildup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.1. Evaluating the basic processes . . . . . . . . . . . . . . . . . . . . 554.2.2. Further variations of parameter values . . . . . . . . . . . . . . . . 58

5. Conclusion 61

A. Alphabets 67

ii

B. Additional evaluation results 68

C. Media content 70

iii

List of Tables

4.1. Distribution of classes within the different representations of the data basis 494.2. Overview of the RapidMiner processes used for evaluation . . . . . . . . . 504.3. Alphabet size, possible combinations and length of the longest sequence . 534.4. Example of confidence values achieved by two different classifiers . . . . . 544.5. Selected performance values for the local random seed 1992 . . . . . . . . 554.6. Optimal parameter values for the local random seed 1992 . . . . . . . . . 564.7. Processing time of the evaluation processes . . . . . . . . . . . . . . . . . 574.8. Selected results for the extended processes . . . . . . . . . . . . . . . . . . 59

B.1. Selected performance values for the local random seed 1225 . . . . . . . . 68B.2. Optimal parameter values for the local random seed 1225 . . . . . . . . . 68B.3. Selected performance values for the local random seed 1564 . . . . . . . . 69B.4. Optimal parameter values for the local random seed 1564 . . . . . . . . . 69

iv

List of Figures

2.1. The process of automated facial expression analysis [FL03, p. 262] . . . . 72.2. The Forward Algorithm [Fin14, p. 81] . . . . . . . . . . . . . . . . . . . . 192.3. The Backward Algorithm [Fin14, p. 91] . . . . . . . . . . . . . . . . . . . . 20

A.1. Alphabets of possible observations for the different evaluation approaches 67

v

1. Introduction

The development of artificial intelligence is one of the major issues of this day andage, covering a wide range of research and application fields. One core task is theestablishment of a natural social interaction between humans and computers, whichnot only includes verbal communication but also non-verbal communicative behaviourlike the expression and perception of emotions: the research area of Affective Computing“deals with the design of systems and devices which can recognize, interpret, and processemotions” [MKR09, p. 60].

The present paper will focus on the recognizing part of this approach and its appli-cation to a special kind of “emotion”, namely pain: the International Association forthe Study of Pain (IASP) defines pain as “unpleasant sensory and emotional experienceassociated with actual or potential tissue damage, or described in terms of such dam-age” [IAS86, p. 217]. The definition emphasizes the subjective character of pain as theunderstanding and the usage of the term is strongly dependant on a person’s individualexperience. While it is a sensation associated with concrete parts of the body, it canstill be considered as “emotional experience” [IAS86] - therefore, although it may not bepossible to definitely categorize pain as an emotion in the narrow sense, it will be treatedas an emotion-like state in the course of the present thesis, at least in its capacity as acertain sentiment that can be expressed and recognized in a non-verbal way1.

Teaching machines to recognise pain by applying techniques of machine learning is aresponsible and challenging task, as the experience of pain is a critical situation, oftenindicating an immediate threat to a person’s well-being or even physical existence andtherefore making it especially important to detect pain as quickly as possible and witha very high degree of certainty.

An exemplary application field where the recognition of non-verbal signals of pain isimportant is hospital treatment: patients suffering from certain diseases might not beable to express their current sentiments via language, while non-verbal basic reactionsare still intact - in this scenario, the application of machine learning techniques couldsupport the medical staff, for example by creating a system that is able to detect non-verbal communication signals from a person’s mimic reactions [SSS13].

The process of automatically detecting emotional information expressed by a personnon-verbally includes the usage of passive sensors collecting data about the person’scondition and behaviour and methods for extracting adequate “emotional cues” from thecollected data. Possible relevant information might be “facial expressions, body postureand gestures, [...] skin temperature and galvanic resistance” [MKR09, p. 61]. The present

1Deyo et al. state that “it is not implausible to expect that aspects of pain expression and the abilityto perceive its expression would be regulated in ways that resemble the regulation of comparableemotional processes” [DPM04, p. 20].

1

thesis will focus on facial expressions as one of the most frequently considered sources ofemotional information expressed non-verbally. The concept of facial expression analysis,which realizes the effective recognition and interpretation of facial expressions encodinginformation about a person’s emotional condition, will be introduced in detail in thecourse of the present paper. In this context, special emphasis will be put on the FacialAction Coding System (FACS), a system developed for modelling facial expressions bydescribing them as a number of visible movements observed in the face, called ActionUnits (AUs).

The project of the present thesis is to apply a concrete machine learning concept inorder to learn a model that is able to recognize sentiments of pain decoded in facialexpressions. The effectiveness of the model will be evaluated by applying it to a givenset of concrete facial expressions that are already decoded into groups of AUs via FACS- therefore, the major task of the model is the correct classification of these examplesby correctly predicting the sentiments they are decoding.

It would be possible to simply represent facial expressions as sets of AUs without anyinformation concerning the succession of their occurrence - but under the assumptionthat the order in which AUs appear could be a relevant factor for the correct interpreta-tion of a given facial expression, this might result in an undesirable loss of information.Assuming that sequential information in a facial expression shown over a period of timeis important for its interpretation is supported by the theory of Scherer, who points outthe relevance of the chronological aspect of emotion: “In order to deal with the dynamicnature of emotional behaviour, we have to conceptualize emotion as a process ratherthan a steady state” [Sch82, p. 555]. In consideration of this idea, the decoded facialexpressions will be represented as sequences of AUs ordered by the time of their ap-pearance instead of sets without any sequential information. Consequently, the appliedmachine learning technique shall be able to process sequential data. For this purpose,Hidden Markov Models (HMMs) are chosen.

HMMs interpret sequences of observations by calculating the probability with thata certain HMM would produce a given observation sequence. The concept of HMMscan be applied to classification tasks by creating one HMM for each possible class andclassifying each example according to the HMM achieving the highest probability for theexample.

Trying to apply the concept of HMMs to sequences of AUs representing facial ex-pressions results in a problem: as the AUs are ordered by the (onset-)time of theirappearance, two or more AUs appearing at the exact same point in time have to berepresented as a combination of parallel observations - however, HMMs are only able toprocess observations one at a time in their basic implementation. This problem has tobe solved before the performance of a HMM-based model in classifying facial expressionsof pain can be measured. The present thesis will pursue two major objectives describedin the following:

1. The concept of Hidden Markov Models has to be extended in order to enable HMMsto process observation sequences containing parallel observations. Two approacheswill be considered:

2

• A simple approach would be to combine observations occurring in parallel intoa new observation symbol representing a self-contained entity independent ofits original components. This approach does not change the algorithms ofthe HMM concept but rather the representation of the example set that isprocessed.

• An alternative approach will redefine the structure of HMMs in order to en-able them to process sets of single observations. Each underlying probabilitydistribution over all possible observation values is replaced by a set of sepa-rate probability distributions each being associated with only one observationvalue. This way, arbitrary combinations of parallel observations can be cal-culated based on their components dynamically.

2. In order to measure the effectiveness of HMMs as a classifier for emotional facialexpressions - especially expressions of pain - a HMM-based classification model willbe designed and applied in a classification task containing facial expressions de-coded via FACS. The two different approaches for processing parallel observationswill both be tested and compared to each other. Additionally, different representa-tions of the available AU sequences will be tested in order to measure their impacton the learned model’s performance:

• The standard representation contains all AUs that were observed in the pro-cess of decoding the given facial expressions without any alterations.

• The first alternative representation will concentrate on AUs which were iden-tified as relevant for the classification of pain in the literature concerned withfacial expression analysis: while these AUs remain unchanged in the examplesequences, all other AUs are replaced by a single surrogate AU generalisingthem in one symbol.

• The second alternative representation follows the same pattern as the first, butthis time all pain-irrelevant AUs are removed completely from the examples.

The remainder of this work will be structured as follows: the second chapter willintroduce the research area of facial expression analysis, especially emphasizing the au-tomation of facial expression analysis through machine learning applications, the conceptof the Facial Action Coding System and its application in several research projects con-cerned with the recognition of facial expressions of pain. Additionally, it will outline thegeneral idea of machine learning and give a detailed overview of the concept of HiddenMarkov Models. The third chapter will present the implementation of a HMM-basedclassification system realized in the data mining software RapidMiner ; it will also de-scribe the different approaches of processing parallel observations and their integrationinto the aforementioned implementation. The fourth chapter will report the results ofthe evaluation of the given implementation which was tested in a classification task fea-turing facial expressions represented as AU sequences. The conclusion will finally sumup the findings of the present thesis and give an outlook for possible applications andfuture research projects.

3

2. Theoretical background

This chapter will introduce all the important concepts and techniques that are relevantin the context of the present paper. The first part will focus on the analysis of emotionalfacial expressions, especially those of pain, comprising general assumptions about thefacial expression of pain, an introduction to the research area of facial expression analysis,the presentation of the Facial Action Coding System and an overview of several studiestrying to detect pain from facial expressions via FACS. The second part will introducethe machine learning concept of Hidden Markov Models, including the general idea ofmachine learning and the important components and algorithms of the HMM approach.

2.1. Analysing facial expressions of pain

2.1.1. The facial expression of pain as a means of communication

It is not always easy to recognise whether a person is in pain. The verbal communicationof pain seems to be a sufficient means of pain assessment in many situations, but as amatter of fact, self-report measures of pain can be subject to various restrictions: tobegin with, they are “idiosyncratic, depending as they do on preconditions and past ex-perience” as well as “susceptible to suggestion, impression management, and deception”[Coh10]. Additionally, they cannot be applied in cases of “young children, individu-als with certain types of neurological impairment, many patients in postoperative careor transient states of consciousness, and those with severe disorders requiring assistedbreath” [Coh10]. One example are people who suffer from dementia and therefore havedifficulties to actively report ratings of the pain they are experiencing - in this case it isimportant not to underestimate the amount of pain they are feeling [KSH+07].

Finding an alternative method of pain assessment is a difficult task: although it canbe assumed that most people have experienced pain in certain situations and thereforehave a general idea of this common sentiment, it is anything but trivial to recognize andmeasure the pain someone else is experiencing: “Sensations, feelings, and thoughts maybe salient for the sufferer, but they are incomprehensible to others without observablemanifestations” [CPE01, p. 153]. Fortunately, these observable manifestations of paincan take many different forms going far beyond the direct verbalization: non-verbalexpressions of pain could be paralinguistic vocalizations (e. g. crying, moaning), non-verbal qualities of speech (e. g. volume, hesitance, timbre), physiological activity (e. g.pallor, flushing, sweating, muscle tension), bodily activity (reflexes or purposeful action)and finally facial expressions. While some of these examples serve concrete causes, likeavoiding or terminating injury, others, especially facial expressions, serve primarily thecause of communication [CPE01].

4

“The facial expression of pain is considered to be one of the most prominent non-verbal pain behaviors because of its reflexive nature, its salience and because it can bedistinctively differentiated from other affective states” [KSH+07, p. 221]. Therefore, theinterpretation of a person’s facial expression seems to be a promising alternative to assesssentiments of pain in cases where the verbal communication of pain cannot be appliedproperly or is not trusted by an observer.

The basic premise of this approach is the idea that it is possible to systematicallyidentify the sentiment of pain only by analysing the expressions on a person’s face.Deyo et al. [DPM04] discovered that the ability to detect pain in the facial expressionsof others increases with age. While younger children are already able to determinethe presence of pain from certain facial features, older children show an even highersensitivity towards the occurrence and intensity of pain visible in a person’s face. It isassumed that “the older children may employ a more effective algorithm for weighing thearray of information available in the facial displays” [DPM04, p. 19]. It is this algorithmthat will have to be implemented in order to enable machines to learn the detection ofpain from a person’s facial expressions as well.

An important question in this context is whether the facial expression of pain issufficiently universal across different circumstances to enable machines to identify painfrom facial expressions with the necessary certainty. “The notion of Universality impliesthe existence of a pain signal that is consistent across stimulus conditions and cultures”[Prk92, p. 298].

A study by Ekman & Friesen [EF71] investigated whether facial expressions of certainemotions are universal across cultures or rather culture specific. They found out thateven members of preliterate cultures with only little contact to people or media comingfrom literate cultures associated facial expressions with the same emotions as members ofany other culture. In general, factors like age, gender or culture don’t seem to influencethe capability of associating certain emotions with given facial expressions in a universalway. Although the facial expression of pain was not part of this study, it will be assumedthat these findings can be generalized to a variety of other emotional states and thereforealso applied to the facial expression of pain in the context of the present paper.

Prkachin [Prk92] examined another aspect of the universality of the facial expressionof pain by comparing different pain stimulus conditions. These included the inductionof pain via electric shock, cold, pressure and ischemia. By analysing the different re-sulting facial expressions, a general facial expression of pain was extracted, describedas a number of Action Units featured in the Facial Action Coding System (cf. chapter2.1.3) which appear across all different kinds of pain stimuli. The usage of such a set ofpain-relevant action units could also reduce the effort in decoding facial expressions viaFACS with respect to pain - this idea will be taken into account in the evaluation partof the present thesis (cf. chapter 4).

As the human face seems to carry a lot of important information about the emotionalstate of a person, including the presence of pain, and as its expressions seem to beuniversal enough across different cultures and pain stimuli to draw reliable conclusionsfrom interpreting them, the analysis of facial expressions is the method of choice in manyattempts of pain assessment, especially in contexts where the verbal communication of

5

pain is not possible for various reasons. The following chapter will introduce the generalidea of facial expression analysis and different approaches of its realisation with emphasison automating the whole process through the use of machines.

2.1.2. Facial expression analysis

The idea of analysing facial expressions reaches from first considerations formulatedby Darwin up to modern approaches containing the fully automated processing of facialexpressions. With the advance of correlated technologies like face detection, face trackingand face recognition and the decrease of the costs for computational power, the researchfield of facial expression analysis enjoys ongoing interest and is continuously extendedby applying new approaches [FL03].

Facial expressions are the result of contractions of facial muscles which lead to adeformation of certain facial features like eye lid, eye brows, nose, lips or skin. They aredescribed by their location, intensity and dynamics. The duration of facial expressionsis determined by three temporal parameters: “onset (attack), apex (sustain), offset(relaxation)” [FL03, p. 260]. As the facial anatomy and the concrete form of facialexpressions differ from one person to another, it is quite difficult to reliably recogniseand interpret facial expressions without the neutral face as benchmark; additionally, itis harder to analyse spontaneous than posed facial expressions as the latter are usuallyexaggerated [FL03].

In general, facial expression analysis can be divided into two methodological ap-proaches [FL03]: Judgment-based approaches try to directly map facial expressions to acertain number of emotions or mental activities agreed on by a group of experts. Sign-based approaches take an intermediate step by abstracting facial actions and describingthem by their location and intensity without directly assigning a certain sentiment tothem in an act of interpretation. Therefore they are purely appearance-based. For sign-based approaches, the Facial Action Coding System (cf. chapter 2.1.3) is the most pop-ular tool of describing facial actions. The interpretation is carried out by using certaindictionaries after the description of the facial expressions is completed. The differen-tiation between recognition and interpretation of facial expressions allows to measuretheir performance separately; it increases objectivity and is therefore a key advantage ofsign-based approaches.

In the research field of facial expression analysis, it is a major concern to enable ma-chines to perform facial expression analysis automatically. Automatic facial expressionclassification can be divided into two stages: the feature extraction and the feature classi-fication. The first step is especially important, as the second step cannot perform well ifinadequate features are provided. The feature extraction can be performed either by ge-ometric feature-based methods or appearance-based methods. Geometric feature-basedmethods extract components or feature points in order to form a vector representing aface’s geometry given as shape and location of certain facial components as mouth, eyes,eyebrows and nose. Appearance-based methods detect changes in the appearance of theface or certain regions of the face, i. e. the skin texture being altered by wrinkles andfurrows. In general, choosing too many features will result in an unnecessary complex

6

Figure 2.1.: The process of automated facial expression analysis [FL03, p. 262]

classification process, while choosing too few features might lead to a decrease of theperformance of a classifier [LH12].

Automatic facial expression analysis is especially difficult due to the individual phys-iognomy of every single face and additional circumstantial factors; the facial appearancecan be influenced by factors like “age, ethnicity, gender, facial hair, cosmetic productsand occluding objects” [FL03]. Uncertainties may arise as well from factors like poseand lighting changes.

Taking into account all these challenges, automatic facial expression analysis has tobe conceptualized as an elaborate process consisting of several steps described by Fasel& Luettin [FL03] and represented in figure 2.1:

Face acquisition The first step is the identification of faces in a complex scene via anautomatic face detector. Depending on the method being used, the position of theface has to be determined more or less exactly - for example, Active AppearanceModels perform well knowing only the rough position of the face. Some approacheseven allow real time tracking of faces. Face normalization is a helpful step to putfactors like pose and illumination into perspective, but is not ultimately necessaryas long as the extracted feature parameters are normalized before classification.

Feature extraction and representation The extraction of features can be realized inseveral ways, depending on which factors should be focussed on in the process:

• Holistic approaches process the face as a whole, while local approaches focuson certain facial features or areas for recognizing facial action. Facial featurescan be divided into intransient features, like the eyes or the mouth, andtransient features, like wrinkles or bulges that only appear under certain

7

conditions. Face segmentation is another optional step that isolates transientand intransient facial features.

• Motion-based approaches lay emphasis on any kind of changes in the facialappearance, while deformation-based approaches work with the neutral faceas foundation and detect any kind of relevant deviation facial features showcompared to their neutral position.

• Image-based approaches extract information from images without any back-ground knowledge: therefore they are fast and simple, but can be unreliable inmore complex scenarios containing many different views of an object. Model-based approaches describe the structure of a face in a more elaborate wayusing 2D- or even 3D-models estimating motion in a physically correct way,but they rely on very complex mapping procedures and require considerablecomputational power to be applicable.

• Appearance-based approaches focus on the visible effects of facial muscle ac-tivities, while muscle-based approaches try to interfere these muscle activitiesindirectly from the visual information, e. g. via mapping a discovered opticalflow to muscle actions represented in a 3D model.

Feature classification The final step interprets all the information previously extractedin order to draw conclusions about sentiments or other messages decoded in thegiven facial expressions. Traditional approaches use rules and representations ofemotional states that were manually created by experts in order to interpret facialexpressions directly. They might miss distinct expressions on a more subtle leveland individual variations in the expressions of certain sentiments. An alternative isthe usage of a facial expression coding scheme like FACS. Such approaches take anintermediate step of recognizing facial actions in an objective way and translatingthem with the help of rules or dictionaries afterwards. These approaches offer thepossibility to describe facial action without automatically interpreting it, therebyalso opening new possible application fields like the accurate animation of syntheticfaces.

A lot of facial expression analysis systems directly map facial expressions to emotionalcategories - therefore they are relatively inflexible and unable to interpret actions associ-ated with non-emotional activities. FACS solves this problem by taking the intermediatestep of simply describing facial action without any interpretation [FL03]. The followingchapter will introduce FACS and the motives behind its creation.

2.1.3. The Facial Action Coding System

The Facial Action Coding System (FACS ) created by Ekman & Friesen is the mostcommonly used facial expression description system in the behaviourial sciences [SES13].The main goal of FACS was the creation of a comprehensive system that is able torecognize all visibly distinguishable movements that can occur in the human face. Incontrast to previous approaches, which only examined facial movements according to a

8

given set of emotions that were intended to be discovered, the set of movements featuredin FACS was supposed to be exhaustive, including facial movements unrelated to anykind of emotional expression. In order to be complete and objective, it was designed towork independent of any assumptions about possible interpretations of facial expressions.Instead of using an inductive approach, the system was based on an analysis of the facialanatomy [Ekm88].

As movement in the face is always a result of muscle tension or relaxation, it ispossible to define a comprehensive set of minimal movements by considering all visiblechanges in the face caused by activities of facial muscles; those minimal movement arecalled Action Units (AUs). The final set of AUs featured in FACS is restricted to thosemovements that are clearly visible and distinguishable with the human eye. As AUsfocus on describing minimal movements, an AU might be the result of the tension ofmore than one muscle, and one muscle can cause more than one AU [Ekm88]. AUs canbe rated from A (least intense) to E (maximum strength) in their intensity [KSH+07].The 44 basic AUs can be combined into over 7000 different combinations that havealready been observed [ALC+09].

FACS was designed to create a common terminology for the research area of facialexpression analysis in order to enable a versatile usage of the found knowledge and toestablish a common basis for comparing findings of different studies. In this context,it is important to separate the description and the interpretation of facial movementsin order to minimize the amount of possible biases influencing the results. Additionaladvantages of FACS are that it is able to describe asymmetrical facial expressions whereboth sides of the face show different AUs, and that it differentiates between various levelsof intensity with which certain AUs occur. As it provides profound knowledge about themuscular basis, it can also help to offset physiognomic deviations that can occur fromone face to another [Ekm88].

By applying FACS, complex facial expressions can be divided into sets of AUs andtherefore be described comprehensively as the number of visible facial movements. Al-though this can be done for single images of facial expressions, FACS is mostly used toanalyse facial movements that occur over a period of time, usually recorded on video.The description of a person’s facial behaviour via FACS is carried out in four steps[Ekm88]:

1. Recognizing which AUs cause a certain movement in the face.

2. Measuring the intensity of the AUs (for those AUs that are associated with differentdegrees of intensity).

3. Recognizing if a movement is asymmetrical or one-sided.

4. Recognizing the position of the head and the eyes during the movement.

As mentioned in the previous chapter, the Facial Action Coding System can be inte-grated into the process of automated facial expression analysis. For example, Lien et al.[LKCL98] based their computer vision system for automated facial expression recogni-tion on FACS, thereby separating expressions into upper and lower face action. While

9

facial recognition systems so far used to analyse prototypic expressions and classify theminto categories of emotions, they chose to represent facial expressions as sets of AUs inorder to be able to differentiate expressions with a finer granularity.

Strupp et al. [SSB08] used FACS to create a “camera-based system for detectingemotions of a human interaction partner” [SSB08, p. 362], including the detection offacial features based on feature points chosen according to FACS and the subsequentinterpretation of those features according to FACS as well.

The exact description of facial expressions via FACS may not only be used to analysegiven facial expressions, but also to generate them in the first place: Fabri et al. [FMH04]used the system to refine the expressions of an avatar in order to make it appear morerealistic. They assumed that emotional context containing important non-verbal com-municative information is often lost during communication over distance through mediatools - therefore they investigated whether an avatar’s facial expressions created byapplying FACS can be used for non-verbal communication in the context of a “collab-orative virtual environment” [FMH04, p. 66]. The results showed that it is generallypossible to simulate emotions reasonably well (with some exceptions like disgust) usinga FACS-based avatar, therefore opening new areas of application for FACS.

2.1.4. Previous approaches of pain classification using FACS

This chapter will introduce several studies which attempted to realize the recognitionand measurement of pain with the help of the Facial Action Coding System. An early ap-proach was a study conducted by LeResche [LeR82] who raised the question whether painassessment via a person’s non-verbal behaviour is possible. Attempting to find a certainpattern of facial behaviour shown by people experiencing pain, different photographsshowing people in painful situations like surgery or birth were analysed via FACS (inorder to avoid using any general or inferential terms). Trying to find commonalities inall the examined photographs, a combination of Action Units consistently reappearingwas found which was suspected to be indicative for pain. Based on these findings, a highdegree of regularity in facial expressions associated with pain was assumed.

Prkachin [Prk92] compared possible sets of pain-relevant AUs suggested by LeRescheand other scientists and tried to extract those AUs that are universally relevant acrossdifferent pain stimuli. For this purpose, facial expressions of a number of subjects wereanalysed under different pain stimulus conditions and checked for consistently reappear-ing AUs independent of the type of the stimulus. The resulting set of AUs is repeatedlyreferred to in the literature and consists of the following AUs: AU4 (Brow lower), AU6(Cheek raise), AU7 (Lid tighten), AU9 (Nose wrinkle), AU10 (Upper lip raise) and AU43(Eyes close).

Prkachin et al. [PBM94] found out that observers tend to underestimate the pain aperson is experiencing compared to the person’s self-report about the intensity of theirpain. While both the correct detection of pain through human observers and a subject’sself-report about their pain correlate with the appearance of the basic pain-relevant AUsdiscovered earlier [Prk92], the facial expressions seem to represent the experienced painmore accurately than the judgements of untrained human observers.

10

Prkachin & Craig [PC95] assume that the facial actions associated with pain arespecific enough to differentiate expressions of pain from other affective states. Theyfortify Prkachin’s [Prk92] theory of a “prototypical” pain expression: “information aboutpain is conveyed by a relatively discrete set of changes in facial expressions” [PC95, p.202]. Those facial expressions seem to occur especially in situations where the painexperienced by a patient is extremely high and therefore indicates an acute need fortreatment.

A study conducted by Solomon et al. [SPF97] examined the effect of training previouslyuntrained observers by increasing their sensitivity towards the occurrence of the pain-relevant AUs identified by Prkachin [Prk92] in order to make their judgements in painassessment more accurate. It turned out that even a short training had the effect ofimproving an observer’s ability to correctly interpret a person’s facial expression of pain,especially for lower levels of pain - at higher levels, it seems to be more difficult toovercome the underestimation bias.

Kunz et al. [KSH+07] examined the utility of facial expressions as a pain indicator incases of patients with dementia. They found out that the frequency and intensity of fa-cial responses to noxious stimuli was even increased in demented patients, especially forthe pain-relevant AUs. The facial responses of patients with dementia seemed to encodethe intensity of the pain stimulus at least as well as those of healthy control subjects.On the other hand, with a decrease of cognitive functioning, patients with dementiawere less able to provide self-report ratings of pain. As frequency and intensity of pain-relevant AUs increase more significantly than frequency and intensity of pain-irrelevantAUs, typical facial expressions of pain are even more visible in patients with dementia- therefore, decoding pain from facial expressions works even better for patients withdementia than average subjects, proving that facial expression analysis is a reasonableway to assess pain in demented patients that are not able to report pain verbally. Aspeople are increasingly less able to communicate their sentiments of pain with an in-creasing degree of cognitive impairment, it is especially important to find alternative“observational pain assessment tools” [KSH+07, p. 226]. Another study by Kunz et al.[KMS+09] comparing different approaches of pain assessment confirmed these findings,stating the increased responsiveness of facial expressions in demented patients to be areliable pain indicator, especially compared to other indicators like self-report ratings ofpain or heart rate responses which become less reliable in the case of demented patients.A study by Lintz-Martindale et al. [LMHBG07] arrived at a similar conclusion whenexamining facial reactions of patients with Alzheimer’s disease to given pain stimuli,therefore proving FACS to be a valid tool to differentiate between several levels of painintensity independent of a person’s ability to communicate that pain verbally.

While facial expressions decoded via FACS are usually represented as sets of ActionUnits without taking into account the succession in which they appear, Schmid et al.[SSS+12] raised the question “whether the structural patterns of AU appearances containdiagnostically relevant information” [SSS+12, p. 183]. Assuming that a sequence ofAUs might contain more information than a mere set of the same AUs without anytemporal information, a learning algorithm able to process sequences of observations astraining examples had to be applied, which was supposed to be based on “grammatical

11

inference (GI) for structural pattern recognition” [SSS+12, p. 183]. The data basis wasthe same as in the study of Kunz et al. [KSH+07]. Within an episode of pain induction,the occurring AUs were separated according to their onset time and represented as asequence. If more than one AUs had the exact same onset time, those AU-compoundswere treated as self-contained symbols, resulting in an alphabet of 76 different AUsand AU-compounds. The chosen learning algorithm was ABL, realising “unsupervisedlearning of context-free grammars by aligning sequences” [SSS+12, p. 184]. As there wereonly positive instances of pain sequences, the performance of the learned grammar hadto be evaluated in a special way: via cross-validation, a grammar was learned from thetraining set; afterwards, it was tested how many instances of the test set were acceptedby the grammar. The learned grammar performed reasonably well in generalizing tounseen pain sequences, but trying to reduce the number of learned rules led to a severedecrease of the performance. Stocker et al. [SSS13] tried to reduce the complexity of thelearned grammar by using genetic algorithms. The correctness of the learned grammarcould not be fully evaluated due to a lack of negative examples.

A study conducted by Siebers et al. [SES13] investigated the relevance of sequentialinformation for the correct classification of facial expressions through human observers.Sentiments of pain and disgust were displayed by modelling the facial expressions asexecutions of AUs performed by an avatar and had to be interpreted by the subjects.Sentiments were identified easier from video sequences than from images; a sequentialmodelling of the AUs led to better performances in identifying disgust compared to asimultaneous modelling, while there were no significant differences in the case of pain,except the confusion rate between both sentiments being reduced.

A study by Siebers et al. [SSS+16] extended the approach of Schmid et al. [SSS+12]:Again, grammar inference for structural pattern recognition was applied in order toexamine the relevance of sequential information for facial expressions represented assequences of AUs. Episodes of pain induction were translated into sequences of AUsby stringing AUs together in order of their onset time, treating AUs with the sameonset time as self-contained AU-compounds. The basic idea of the study was to replaceall occurrences of pain-irrelevant AUs with the “wild card” Action Unit I and to onlyleave the pain-relevant AUs unchanged. The results showed that a reduced alphabetof AUs leads to a significantly higher recall while precision and accuracy only decreaseslightly at the same time - therefore it seems to be a promising approach to consider theaggregation of pain-irrelevant AUs and to concentrate on pain-relevant AUs only.

Of course the usage of FACS is not the only way to perform systematic pain assess-ment. As the chapter about facial expression analysis (cf. 2.1.2) showed, there are otherapproaches that don’t draw such a clear distinction between describing and interpretingfacial expressions which can be applied for pain assessment as well. One example isthe work of Ashraf et al. [ALC+09], who successfully created an automatic system forpain assessment based on Active Appearance Models. They used first-order classifiers,recognizing pain directly from shape and appearance features without the intermediatestep of action unit detection, arguing that second-order classifiers might be vulnerableto error during the process of action unit detection - but also conceding that they mighthave an advantage in revealing more detailed information about facial actions. Siebers

12

et al. [SKLS09] measured the performance of several classifiers recognizing facial expres-sions of pain on the basis of feature point annotation inspired by but not directly usingFACS, achieving good results for individual classifiers only trained on one person’s dataas well as global classifiers abstracting from several person’s facial expressions.

Like the studies by Schmid et al. [SSS+12] and Siebers et al. [SSS+16], the presentthesis will examine the relevance of sequential information for a successful classificationof facial expressions showing pain. Like the aforementioned studies, it will use data thatare already given as sequences of AUs manually decoded via FACS - therefore, this workwill only focus on the classification step within the process of facial expression analysis.The chosen machine learning technique for processing sequential data will be the conceptof Hidden Markov Models, which will be introduced in the following.

2.2. Classification via Hidden Markov Models

2.2.1. The principles of Machine Learning

The general idea behind the creation of machines was “to liberate people from certainevery-day tasks” [Fin14, p. 1]. What started with simple calculators soon led to the ideaof artificial intelligence being able to outperform humans in some ways. A considerablemilestone was the chess computer Deep Blue defeating world champion Kasparov in 1997[Fin14].

A crucial aspect of artificial intelligence is the ability to learn, that is, “to improve au-tomatically with experience” [Mit97, p. 1]. Numerous application areas are imaginable,including computers learning adequate treatments from medical records or intelligenthouses optimizing energy costs based on the analysis of usage patterns [Mit97]. Mitchelldefines the process of machine learning as follows:

“A computer program is said to learn from experience E with respect to some classof tasks T and performance measure P , if its performance at tasks in T , as measured byP , improves with experience E.” [Mit97, p. 2]

In order to specify a “well-defined learning problem”, the three components T , P andE have to be clearly identified [Mit97]: for the example of facial expression analysis, thelearning problem could be defined as follows:

• Task T : recognizing and classifying facial expressions.

• Performance measure P : the percentage of correctly classified facial expressions;possibly also the average confidence of correct classifications.

• Training experience E: amount of facial expressions already classified.

The general idea behind machine learning is the concept of pattern recognition: “Thefield of pattern recognition is concerned with the automatic discovery of regularities indata through the use of computer algorithms and with the use of these regularities to

13

take actions such as classifying the data into different categories” [Bis06, p. 1]. Patternsare information measured by certain sensors, like visual or acoustic signals; the rules todescribe them are not manually specified, but automatically extracted in the patternrecognition process [Fin14].

Pattern recognition includes several learning approaches: the procedure of supervisedlearning requires that the target values of the training examples are known to the learn-ing system. If these values are taken out of a finite set of discrete categories, the learningproblem is a classification problem. If they consist of continuous variables, it is a regres-sion problem instead. If no target values are known at all, the procedure of unsupervisedlearning has to be applied: possible approaches in this field are the discovery of groups ofsimilar examples in the training set (clustering), the investigation of the distribution ofdata (density estimation), or the reduction of the data’s dimensionality (visualization).Finally, the approach of reinforcement learning searches for optimal policies for scenariosin which a certain action has to be chosen in a given situation in order to maximize areward which is pursued by the learning system [Bis06].

In the present thesis, only the approach of supervised learning dealing with classifi-cation problems will be considered. Classification learning takes an input x and assignsit to “one of K discrete classes Ck where k = 1, ...,K” [Bis06, p. 179]. The classesappearing in the course of this work will be assumed to be disjoint, meaning that everyexample is only assigned to exactly one class [Bis06].

The process of classification learning consists of the adaptation of the parameters ofa certain model in order to fit a given training set containing examples belonging toa certain application domain. The categories or classes of the training examples areknown in advance, therefore allowing a learning algorithm to draw inferences aboutpossible connections between the properties of a given example and its assigned class.The processing of a machine learning algorithm during the training or learning phaseresults in a function that is able to inspect a new example and predict its category. Thelearned system can apply this function to previously unseen examples whose category isnot available to the system; a number of such examples used to test the performance ofthe system is called test set. The ability of a trained system to successfully categorizepreviously unseen examples is called generalization, which is the central goal of patternrecognition [Bis06].

In order to reduce the complexity of a number of given examples and to highlightimportant features that should be considered by the learning algorithm, or in order tosimply reduce the required computational power, examples are often preprocessed beforethe training phase: this process is called feature extraction [Bis06] - a term that alreadyappeared in chapter 2.1.2 in conjunction with the process of facial expression analysis.An example would be video sequences of facial expressions that are decoded via FACSand represented as sequences of Action Units - in this case, a learning algorithm wouldonly concentrate on the sequence of AUs instead of the raw video sequence of a facialexpression containing a many times greater amount of information. If feature extractionis applied to the training set, is has to be applied to all examples that are to be processedby the system in the same way afterwards in order to enable the application of the learnedfunction [Bis06].

14

A machine learning process can be seen as searching a space of hypotheses in order tofind the hypothesis that fits the training examples best while at the same time being ableto generalize to unseen examples in the future. In this context, an important questionis whether the set of training examples represents the actual distribution of examples inthe application area in which the learned system is supposed be applied [Mit97]. If theset of training examples is too specific and does not represent the entirety of examplespresent in a given application domain, the learning system will only be optimized for asubset of relevant examples and is more likely to perform poorly on all other kinds ofexamples - this effect is called overfitting :

“Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training dataif there exists some alternative hypothesis h′ ∈ H, such that h has smaller error than h′

over the training examples, but h′ has smaller error than h over the entire distributionof instances.” [Mit97, p. 67]

2.2.2. The structure of Hidden Markov Models

The basic assumption behind Hidden Markov Models is the idea that a “real world pro-cess produces a sequence of observable symbols” [RJ86, p. 4], and that it is possibleto create a model that explains the given sequence as well as additional sequences ofobservations consisting of the same symbols as the original one [RJ86]. Scherers descrip-tion of emotion as “characterized by sequential intraorganismic information processingand extremely complex interactions between the various components” [Sch82, p. 556]resembles the idea of such a natural process quite well, with the visible Action Unitsrepresenting the observable symbols - therefore, HMMs might be a suitable proceedingfor modelling emotional processes expressed through facial action.

“In the field of pattern recognition, signals are frequently thought of as the productof sources that act statistically” [Fin14, p. 71]. A machine learning algorithm analysingthese signals should be able to recognize and model the statistical properties of theseassumed sources as accurately as possible. In this context, the conception of HMMsfocusses on the evolution of a signal over time, therefore emphasizing the importance ofthe signal’s sequential information [Fin14].

In general, the goal of algorithms processing sequential information is to predict fu-ture observations on the basis of a number of previous observations, thereby exploitingsequential patterns in the data. Assuming that the probability of an observation de-pends on the entirety of all previous observations would result in an incredibly complexmodelling process. Therefore, in order to guarantee the practicability of consideringsequential information within the scope of pattern recognition, Markov Models limit theamount of previous observations that are considered for analysing an element of an ob-servation sequence - this circumstance is called the Markov Property. Only consideringthe immediately preceding observation would result in first-order Markov chains [Bis06].Though the usage of higher-order Markov chains is possible as well, the present thesiswill only concentrate on first-order Markov chains in the following.

15

The precursor concept of Hidden Markov Models are Markov Chain Models, which weredeveloped by Andrej Andrejewitsch Markov (1856-1922), a Russian mathematician. AMarkov Chain Model consists of several states representing the single values out of a setof discrete values, e. g. all words that exist in a certain lexicon. Given a sequence ofsuch values, e. g. a text sequence consisting of a number of words, the model states theprobability of each value given the whole sequence as context (or to be precise: a partof the sequence depending on the Markov Property). Combining all these probabilitiesresults in the overall probability of the whole sequence, therefore enabling the model toassign a probability to every possible combination of values - in the lexicon example, anadequate model would assign greater probability to plausible sentences of the underlyinglanguage of the lexicon than to implausible ones. The decision whether a sequencebelongs to a model is therefore probabilistic and not deterministic: a sequence could berepresented by different models with different probabilities [Fin14].

Hidden Markov Models extend this concept in a way that the states are not associatedwith single values any more but instead output one value out of a set of possible outputvalues or observations. The visible output of a HMM process is a sequence of obser-vations, while the underlying sequence of states producing the observation sequence ishidden to the observer [Fin14].

HMMs are most often used in the application fields of automatic speech recognition,automatic recognition of handwritten texts and the analysis of biological sequences ofRNA (ribonucleic acid) or proteins. Detailed explanations of the concrete usage ofHMMs in these fields can be found in Fink [Fin14].

A HMM consists of a finite number (N) of single states. At an arbitrary point intime t, a new state is activated, chosen according to a transition probability distribu-tion depending on the previous active state (with regard to the Markov Property). Thenew activated state might as well remain the previous active state, as each state canhave a transition probability to itself. After the transition, an output or observation isproduced according to yet another probability distribution depending on the currentlyactivated state. Therefore, each one of the N states is associated with an individualtransition probability distribution and an observation probability distribution - both ofthem remain constant over time [RJ86]. Therefore they represent stationary sequentialdistributions, meaning that the data evolves over time while the underlying distribu-tion remains the same. The alternative would be nonstationary sequential distributions,where the underlying distribution itself is changing with time just like the data produced[Bis06].

HMMs decribe a stochastic process consisting of two stages: the first stage realizes theconcept of Markov Chain Models as a finite state space with probabilistically describedstate transitions. Whether a state St is visited in the course of a process at a point intime t only depends on the direct precursor state St−1 according to the Markov Property[Fin14]:

P (St | S1, S2, ..., St−1) = P (St | St−1). [Fin14] (2.1)

16

In the second stage, an output or observation named Ot is created for every point intime t of a process. The probability distribution responsible for the generation of thatobservation is only dependent on the state St which is active at the time the observationis created - this is called the output independence assumption [Fin14]:

P (Ot | O1, ..., Ot−1, S1, ..., St) = P (Ot | St). [Fin14] (2.2)

While an observation sequence O1, O2, ..., OT created during a process can be perceivedby an observer, the underlying state sequence S1, S2, ..., ST is hidden (leading to the termHidden Markov Model). A time interval of length T specifies the length of a sequenceto be observed. In order to determine the state at time t = 1, an additional probabilitydistribution over all possible states with the only purpose of picking a starting stateis established, as no precursor state can be determined at this time. A HMM, usuallydenoted by λ, consists of the following components [Fin14]:

• a finite set of states referred to by indices:

s | 1 ≤ s ≤ N. [Fin14] (2.3)

• a matrix of state-transition probabilities:

A = aij | aij = P (St = j | St−1 = i). [Fin14] (2.4)

• a vector of start probabilities:

π = πi | πi = P (S1 = i). [Fin14] (2.5)

• a number of observation probability distributions, specific for each single state anddepending on the type of observation to be processed by the HMM:

– discrete HMMs process observations taken from a finite set of discrete values,represented in a matrix of output probabilities:

B = bjk | bjk = P (Ot = ok | St = j). [Fin14] (2.6)

– continuous HMMs process observations that are vector valued quantities, de-scribing the output distributions on the basis of continuous probability densityfunctions:

bj(x) | bj(x) = p(x | St = j). [Fin14] (2.7)

In the following, the present thesis will only concentrate on discrete HMMs. Assumingthat a HMM models a real process generating a sequence of signals, the creation of suchan observation sequence O can be described as follows [RJ86]:

17

1. For t = 1, the initial active state is chosen according to the distribution of startingstates π.

2. The observation Ot is chosen according to the currently active state’s observationprobability distribution.

3. • If t+ 1 < T , the next active state is chosen according to the currently activestate’s transition probability distribution; t is set to t+ 1; afterwards, steps 2and 3 are repeated.

• Otherwise, the procedure is terminated.

Rabiner & Juang [RJ86] formulate three major problems that have to be solved inorder to successfully apply HMMs to real world scenarios:

1. The calculation of the probability with which a given HMM λ produces a givenobservation sequence O: P (O | λ).

2. The detection of an optimal sequence of underlying states associated with a givenobservation sequence O for a given HMM λ.

3. The optimization of the parameters A, B and π of a HMM λ in order to maximizethe probability P (O | λ) for a given observation sequence O.

2.2.3. Probability calculation: the Forward and the Backward Algorithm

The first problem formulated by Rabiner & Juang addresses the calculation of the totaloutput probability P (O | λ) for a given sequence O. For every possible observationsequence O1, O2, ..., OT , an underlying state sequence s = s1, s2, ..., sT of the same lengthhas to be assumed. Given a particular state sequence, the output probability is theproduct of the probabilities of the given observations defined for the corresponding statesgenerating them [Fin14]:

P (O | s, λ) =

T∏t=1

bst(Ot). [Fin14] (2.8)

The probability of that state sequence itself can easily be calculated as the productof the corresponding state transition probabilities (assuming that a0i := πi and s0 := 0)[Fin14]:

P (s | λ) = πs1

T∏t=2

ast−1,st =T∏t=1

ast−1,st . [Fin14] (2.9)

A combination of the equations 2.8 and 2.9 results in the probability of the observationsequence O occurring along a certain state sequence s:

P (O, s | λ) = P (O | s, λ)P (s | λ) =

T∏t=1

ast−1,stbst(Ot). [Fin14] (2.10)

18

Let αt(i) = P (O1, O2, ..., Ot, st = i | λ)

1. Initializationα1(i) := πibi(O1)

2. Recursionfor all times t, t = 1, ..., T − 1:αt+1(j) :=

∑iαt(i)aijbj(Ot+1)

3. TerminationP (O | λ) =

∑Ni=1 αT (i)

Figure 2.2.: The Forward Algorithm [Fin14, p. 81]

Because of the specific structure of HMMs, a given observation sequence might begenerated by every possible state sequence with a length equal to the sequence’s length- each combination of states will result in another output probability for the givensequence. As the underlying state sequence producing an observation sequence is hiddenand cannot be determined, all possible state sequences of length T have to be consideredin order to fully determine the total output probability of an observation sequence Ofor a given HMM λ: it is calculated by summing up the probabilities computed for allpossible state sequences [Fin14]:

P (O | λ) =∑s

P (O, s | λ) =∑s

P (O | s, λ)P (s | λ). [Fin14] (2.11)

This method results in an exponential growth of the required computational powerwith a complexity of O(TNT ). In order to reduce this effort, an alternative method ofcomputing the total output probability is applied which is called the Forward Algorithm(figure 2.2). It exploits the Markov Property stating that the values calculated for a stateare only influenced by the direct precursor state in the course of any arbitrary process:therefore, the time at which a state is visited and the combination of states visitedbefore its direct predecessor state are completely irrelevant to the values calculated forthat state. With each state only considering its N (the total number of states in theHMM) possible direct precursor states, the complexity of the process can be reduced tolinear growth [Fin14].

The Forward Algorithm uses the forward variable αt(i), representing the probabilityof a process executed by a HMM λ arriving at state i at time t, while creating a givenobservation sequence up to Ot on the way [Fin14]:

αt(i) = P (O1, O2, ..., Ot, st = i | λ). [Fin14] (2.12)

For the very first step of a sequence, no precursor state to evaluate is available -therefore, for every possible state i, the starting probability πi is multiplied with thestate’s output probability bi(O1) of the first observation present in the given sequence.

19

Let βt(i) = P (Ot+1, Ot+2, ..., OT | st = i, λ)

1. InitializationβT (i) := 1

2. Recursionfor all times t, t = T − 1, ..., 1:βt(i) :=

∑jaijbj(Ot+1)βt+1(j)

3. TerminationP (O | λ) =

∑Ni=1 πibi(O1)β1(i)

Figure 2.3.: The Backward Algorithm [Fin14, p. 91]

For every subsequent point in time t + 1 after the first one, the forward variable canbe computed for every possible state j as the incoming state transition probability ofevery possible precursor state i multiplied with the forward variable of that precursorstate for the previous point in time αt(i) (which is always available through the recursivestructure of the algorithm), summed up over all possible precursor states i and afterwardsmultiplied with the output probability for the corresponding observation at the currentpoint in time t+ 1. After this procedure is carried out for the last step of the sequenceat time T , the final forward variables of all possible states can be summed up, resultingin the total output probability for the sequence O given the HMM λ.

While the Forward Algorithm calculates the total output probability by going throughthe sequence of observations from the beginning, the alternative Backward Algorithm(figure 2.3) starts at the last observation of the sequence and works through the wholesequence backwards. For this purpose it uses the backward variable βt(j), the counterpartof the forward variable: it represents the probability of a process executed by a HMM λfinishing a given observation sequence after leaving a given state j at time t [Fin14]:

βt(j) = P (Ot+1, Ot+2, ..., OT | st = j, λ). [Fin14] (2.13)

The probability of not generating any further observations after the last step of theobservation sequence is simply initialized as 1 for every state. Afterwards, every previouspoint in time is examined iteratively: for every possible state i that could have beenactive at that previous point in time t, the backward variable sums up all possibleoutgoing transition probabilities to other states, each of them multiplied with the outputprobability with that the reached state j would create the observation found at thesubsequent point in time t + 1 and the backward variable βt+1(j) calculated for thatstate for the subsequent point in time (which is already computed due to the recursivenature of the algorithm). In the last step, the backward variables for the very first pointin time of the process are summed up over all states, each of them multiplied with therespective state’s starting probability and output probability for the first observation,resulting in the total output probability for the sequence O given a certain HMM λ.

20

The Forward and the Backward Algorithm both reach the exact same results for thetotal output probability when applied for a whole sequence - therefore only one of themis needed for probability calculation. However, both of them will be necessary in theprocedure of optimizing HMMs in the course of the learning process described in thenext chapter.

While the total output probability calculates the overall probability of a given observa-tion sequence, considering every possible combination of states of matching length for agiven HMM, the optimal output probability only considers the state sequence achievingthe highest possible probability for an observation sequence, therefore making it possibleto “evaluate the specialization of partial models within the total model” [Fin14, p. 82].It can be calculated as the probability of the observation sequence along the optimalpath (the path resulting in the highest probability) of a given HMM [Fin14]:

P ∗(O | λ) = P (O, s∗ | λ) = maxsP (O, s | λ). [Fin14] (2.14)

The optimal output probability can be calculated efficiently by an altered version ofthe Forward Algorithm. The Viterbi Algorithm calculates the optimal path itself andtherefore solves the second problem formulated by Rabiner & Juang [RJ86]:

s∗ = arg maxsP (s | O, λ) = arg max

sP (O, s | λ). [Fin14] (2.15)

The algorithms for efficiently calculating the optimal path and the optimal outputprobability are specified in Fink [Fin14]. For the purpose of this work, only the totaloutput probability will be considered in the following.

2.2.4. Learning: the Baum-Welch-Algorithm

It is not possible to build a HMM optimized for given example data from scratch; thelearning procedure can only improve a given HMM in a way that the updated modelstatistically resembles the given example data better than the original one. Therefore,the initial model used for optimization influences the quality of the result significantly.Ideally, reasonable starting values for the HMM initialization can be estimated by expertsof an application domain; either way, the initial HMM has to be optimized for a givenset of training data iteratively [Fin14].

The Baum-Welch-Algorithm is a variation of the EM-Algorithm which is a generalmethod for optimizing statistical models with hidden variables [Fin14]. It optimizes aHMM with respect to the total output probability P (O | λ)1 - therefore it solves thethird problem formulated by Rabiner & Juang [RJ86]. Like all HMM learning methods,it transform the parameters of a HMM λ in a way that the resulting HMM λ achievesa score that is better than or at least equal to the score achieved by the original HMM;the score to be optimized is the total output probability P (O | λ) in case of the Baum-Welch-Algorithm [Fin14]:

1Alternative learning algorithms, considering the optimal state sequence and the resulting probabilityP (O, s∗ | λ) instead, would be the Viterbi Training or the segmental k-means method described byFink [Fin14].

21

P (O | λ) ≥ P (O | λ). [Fin14] (2.16)

If the trained HMM achieves the same score as the original HMM, it can be assumedthat the original HMM already reached a local maximum with the given parametervalues [Fin14].

Generally, the update of the model parameters depends on the expected frequencyof certain events which is derived from analysing the training set. State transitionprobabilities and output probabilities are redefined as the expected number of specifictransitions from or outputs of a state relative to the expected number of total transitionsfrom or outputs of that state. In order to estimate the expected number of eventsassociated with a certain state i, the probability that this state was active at a given pointin time t has to be determined and will be referred to as state probability P (St = i |O, λ),usually denoted as γt(i) [Fin14].

While the forward variable αt(i) determines the probability of reaching a state i at agiven point in time t of a certain sequence, the backward variable βt(i) determines theprobability of completing the given sequence by starting from state i at time t. Thecombination of both procedures, called the Forward-Backward Algorithm, is thereforeable to predict the probability of state i being active at time t during an observationsequence. Generally, the state probability can be computed as follows:

P (St = i | O, λ) =P (St = i,O | λ)

P (O | λ). [Fin14] (2.17)

While P (O | λ) can be computed by using either the forward or the backward pro-cedure for the whole observation sequence, P (St = i,O | λ) has to be determined bycombining the probability of reaching state i at time t in the course of the process gen-erating the observation sequence and the probability of completing the sequence in thecourse of the process afterwards:

P (St = i,O | λ) = P (O1, O2, ..., Ot, St = i | λ)P (Ot+1, Ot+2, ..., OT | St = i, λ)

= αt(i)βt(i). [Fin14](2.18)

Therefore, the state probability can be calculated as:

γ(i) = P (St = i | O, λ) =αt(i)βt(i)

P (O | λ). [Fin14] (2.19)

In order to update a state transition probability aij from state i to state j, the proba-bility of that transition given the training data O has to be calculated for every possiblepoint in time t as ξt(i, j) in the next step [Fin14]:

ξt(i, j) = P (St = i, St+1 = j | O, λ) =P (St = i, St+1 = j,O | λ)

P (O | λ)

=αt(i)aijbj(Ot+1)βt+1(j)

P (O | λ). [Fin14, notation adjusted]

(2.20)

22

These values can only be computed for all t up to T − 1, as the state active at thelast step of an observation sequence has no more outgoing transitions that could beconsidered. After all ξt(i, j) values have been calculated, an updated state transitionprobability aij can be computed for every combination of i and j, by summing up theexpected transition probabilities from a certain i to a certain j over all times t up toT−1 and normalizing it by the expected probability of all outgoing transitions of state i,which can be calculated as the general state probability γt(i) summed up over all timest up to T − 1 [Fin14]:

aij =

∑T−1t=1 P (St = i, St+1 = j | O, λ)∑T−1

t=1 P (St = i | O, λ)

=

∑T−1t=1 ξt(i, j)∑T−1t=1 γt(i)

. [Fin14, notation altered]

(2.21)

In the next step, the starting probability of every state i has to be updated: asno ingoing state transitions can be considered here, the starting probability is simplyestimated as equal to the probability of the given state being active at the very firstpoint in time in the process, which is γ1(i) [Fin14]:

πi = P (S1 = i | O, λ) = γ1(i). [Fin14] (2.22)

Finally, the output probabilities have to be updated for every state j: for every discreteobservation ok, the probability of state j being active, summed up over all times where okwas observed in the given training set, is divided by the summed up probability of statej being active over all times t regardless of the observed output, resulting in normalizedupdated observation probabilities2:

bj(ok) =

∑Tt=1 P (St = j,Ot = ok | O, λ)∑T

t=1 P (St = j | O, λ)=

∑t:Ot=ok

P (St = j | O, λ)∑Tt=1 P (St = j | O, λ)

=

∑t:Ot=ok

γt(j)∑Tt=1 γt(j)

. [Fin14]

(2.23)

The model parameters of HMMs can be trained automatically from a given training setas described above, giving HMMs an advantage over symbolic or rule-based approaches- but this advantage can only be used if the training set is extensive enough. Unlikethe parameters themselves, the configuration of these parameters, like the number ofstates of the HMM to be created, have to be determined manually. Finally, every HMMlearning process requires an initial HMM as starting point which can severely influencethe performance of the learning process depending on the initial values [Fin14].

2The updated output probability distributions are calculated differently in the case of continuous ob-servations - this procedure is described by Fink [Fin14].

23

2.2.5. HMMs as classifiers

The premise of the HMM concept is that the data available for training was “generatedby a natural process which obeys [...] statistical regularities” [Fin14, p. 4] that can bemodelled by HMMs. The model that is learned based on this premise shall reproducethe pattern of the assumed process as closely as possible in order to be able to guess theoverall probability of a given sequence of observed outputs. Furthermore, using HMMsin classification tasks is possible by associating one separate HMM with every possibleclass and comparing the probabilities these HMMs calculate for given sequences: thetotal output probability P (O | λ) represents the probability with that a given HMMgenerates a certain sequence of observations, therefore indicating “how well a certainmodel describes given data” [Fin14, p. 77] - the model that describes a given sequencebest by achieving the highest probability therefore determines the class membership.

In a classification task using Hidden Markov Models, different HMMs, each of themassociated with a certain class (i. e. the HMM λi modelling properties of the class Ωi),can be compared in order to find the HMM and therefore the class achieving the highestposterior probability P (λj | O) for a given observation sequence O:

λj = arg maxλi

P (λi | O) = arg maxλi

P (O | λi)P (λi)

P (O)

= arg maxλi

P (O | λi)P (λi). [Fin14](2.24)

The probability P (O) is not relevant for the classification, as it is a constant which isindependent of λi - therefore it can be dropped from the equation. If the prior proba-bilities P (λi) cannot be evaluated for a given classification task, they might be droppedas well, leaving only the total output probability P (O | λi) as the remaining relevantvariable. Such a classification approach that does not consider the prior probability iscalled Maximum Likelihood (ML) classification, while an approach considering the priorprobability is called Maximum A Posteriori (MAP) classification [Fin14].

24

3. The implementation of a HMM-basedclassification system including theprocessing of parallel observations

Now that the theoretical basis of this work is set, a system has to be created thatrealizes the execution of classification tasks with the help of Hidden Markov Models -a special capacity of this system shall be the processing of parallel observations thatappear in the course of an observation sequence. The first part of this chapter willdescribe the realization of a HMM-based classification system in the data mining softwareRapidMiner. The second part will introduce the different approaches of handling parallelobservations and necessary extensions of the RapidMiner implementation in order tointegrate them.

3.1. Realizing HMM-based classification in RapidMiner

In order to evaluate HMM-based classification of facial expressions of pain, a concreteimplementation of the concept described in chapter 2.2 has to be provided, includingthe possibility to create and train HMMs for a given training set and to measure theirperformance in classification tasks via a test set. RapidMiner is one of the most populardata mining software solutions. It allows the execution of machine learning processeson the basis of a wide variety of machine learning concepts. As the standard versionof RapidMiner does not support the application of machine learning via HMMs yet, anextension has to be created which realizes the integration of the concept of HMM-basedclassification into RapidMiner. This chapter will determine the general requirements fora realization of HMM-based classification, introduce the software RapidMiner as wellas the jahmm library, which implements the basic functionality of the HMM concept,and finally present a RapidMiner extension based on the jahmm library meeting all thespecified requirements of a successful HMM-based classification system.

3.1.1. Requirements

An implementation of a HMM-based classification system has to meet several require-ments that are specified in the following. In the implementation created for the presentthesis, the following limitations shall apply:

• Classification shall be applied via first-order Hidden Markov Models only, meaningthat probability values calculated for a certain state only depend on the immediateprecursor state.

25

• Classification shall be applied via discrete Hidden Markov Models only, meaningthat possible observation values are taken out of a finite set of discrete symbols.

• Classification shall only consider disjoint classes, meaning that every example isassigned to only one class.

• Classification shall only consider the standard total output probability P (O | λ) asclassification criterion.

Under these limitations, the implementation to be created has to meet the followingrequirements:

1. The implementation shall be able to represent a first order discrete Hidden MarkovModel consisting of an arbitrary finite positive integer number of states and spec-ified for an arbitrary finite alphabet of discrete observation values. The HMM hasto include:

a) A probability distribution of starting states (π), associated with the numberof states the model consists of.

b) A matrix of state transition probabilities (A), including each possible state’sprobability distribution of state transitions to all other possible states (in-cluding the considered state itself).

c) A matrix of output probabilities (B), including each possible state’s proba-bility distribution associated with the alphabet of possible observations.

All probability values shall be positive decimal numbers between 0.0 and 1.0; thevalues within every single probability distribution shall always add up to 1.0.

2. The implementation shall make it possible to initialize a HMM as described inrequirement 1. It will be assumed that no information is available for estimatingthe start parameters of a HMM, therefore the initial values shall be chosen bysimply defining the number of states of the HMM and assigning random values tothe probability distributions. For this purpose, it shall be possible to establish arandom seed, allowing to always create the same pseudo-random values by usingthe same seed in order to make results repeatable and comparable.

3. The implementation shall comprise implementations of all algorithms of the HMMconcept that are necessary to perform a classification task under the aforemen-tioned limitations - therefore, the minimum set of required algorithms includes:

a) The Forward Algorithm and the Backward Algorithm for calculating the totaloutput probability P (O | λ) and the γt(i) values.

b) The Baum-Welch-Algorithm for training a given HMM in order to maximizethe total output probability P (O | λ) of a given training set.

4. The implementation shall be able to associate the entities processed in the courseof a classification task with labels indicating the class they belong to:

26

a) A HMM shall be associated with a label indicating the class it is representing.

b) Observation sequences shall be associated with labels indicating their classmembership.

5. The implementation shall be able to represent a classification model comprisingone HMM for every class associated with a given classification task. The modelshall be able to perform the following actions:

a) Given an observation sequence, the model shall calculate the total outputprobability P (O | λ) for every HMM that it part of the model. Based on thesevalues, the model shall be able to predict the class of the observation sequenceby either applying Maximum Likelihood (ML) classification or Maximum APosteriori (MAP) classification:

i. In the case of ML classification, the predicted class of the observationsequence shall be chosen as the class of the HMM achieving the highesttotal output probability for the given sequence. This HMM is chosen as:arg max

λiP (O | λi).

ii. In the case of MAP classification, the predicted class of the observationsequence shall be chosen as the class of the HMM achieving the highestvalue in multiplying the achieved total output probability for the givensequence with the HMM’s prior probability P (λi). This HMM is chosenas: arg max

λiP (O | λi)P (λi).

If the prediction of the sequence’s class by computing the total output prob-ability is not possible, the model shall guess the predicted class as the classassociated with the HMM with the highest prior probability P (λi).

b) Given an observation sequence, the model shall calculate the confidence forevery HMM that is part of the model. The confidence describes the prob-ability with that a certain HMM outputs the given sequence relative to theprobabilities achieved by all other HMMs in the model. Given a number ofK HMMs, the confidence of classifying a given sequence O is calculated forevery HMM λi as:

i. In the case of ML classification: P (O | λi)∑Kk=1 P (O | λk)

.

ii. In the case if MAP classification: P (O | λi)P (λi)∑Kk=1 P (O | λk)P (λk)

.

6. The implementation shall be able to create a classification model as specified inrequirement 5 from a given training set by the following procedure:

a) The examples in the training set are grouped according to the classes theyare assigned to, resulting in a number of subsets where each subset containsall examples associated with one class.

27

b) For each subset, a HMM is learned from all the examples in the subset by ap-plying the Baum-Welch-Algorithm. Afterwards, the learned HMM is labelledwith the class associated with the subset it was created from.

c) The prior probability P (λi) of each HMM is defined as the number of allexamples in the associated subset divided by the number of all examples inthe training set.

d) The classification model is initialized as an entity comprising all the HMMslearned from the subsets of the training set.

7. The implementation shall make it possible to evaluate the performance of a learnedclassification model by letting it classify a number of test sequences and measuringthe percentage of correctly classified examples (accuracy), the recall and precisionvalues of the single classes and the average confidence value for the correct classover all examples in the test set.

3.1.2. RapidMiner

RapidMiner1 is an open source data mining2 software; it is known for “its advanced abil-ity to program execution of complex workflows, all done within a visual user interface”[JVDS14, p. 439]. RapidMiner realizes the implementation of a wide variety of machinelearning concepts and provides an elaborate framework for their application, evaluationand optimization. Therefore it constitutes a promising basis for the task of realizing aHMM-based classification of facial expressions.

The GUI of RapidMiner allows the user to model data mining procedures throughthe combination of operators. An operator is a core unit of RapidMiner: it contains acertain kind of functionality that is executed when the operator is activated. Multipleoperators are part of a process: by creating a new process, the user can arrange operatorsin a certain constellation and specify their input and output values. Additionally, everyoperator has parameters that can be configured in order to configure the functionalityof the process [Chi13].

In RapidMiner, a set of data to be processed is stored in a table called example set.Every single row of the table represents a processable example, while every column of thetable represents an attribute of the example set [Chi13]. Attributes describe categoriesof features that make up the given data, while each example provides concrete values ofthese features in the entries of the corresponding columns [Mie14]. Attributes of examplesets can have different types like integer, real or nominal, as well as different roles: one ofthese role types is label, representing the class to be predicted in a certain classificationcontext [Chi13].

The following RapidMiner operators can be used in order to realize a standard classi-fication task on the basis of an arbitrary machine learning concept:

1https://rapidminer.com/2Data mining is the process of analysing data in order to find useful patterns, therefore being a core

element of the research area of machine learning and pattern recognition [KD15].

28

Set Role This operator makes it possible to change the role of a given attribute: this isespecially important when realising a classification task, as the attribute holdingthe classes to be considered in the classification process has to be highlighted byassigning the label role to it. The operator allows the user to assign any possiblerole to any chosen attribute [Akh14a].

Apply Model This operator performs the application of a trained model on a test setof previously unseen data. The required inputs of the operator are the trainedmodel and an example set to be used as test set. In the application process, anew attribute is added to the example set with the special role prediction. Avalue representing the label or class predicted by the model due to its underlyingalgorithm is assigned to each example of the training set in the column of theprediction attribute [Akh14a].

Performance (Classification) This operator measures the performance of a model bycomparing the predicted labels of an example set created by the Apply Model op-erator with the true labels of these examples that were not visible for the ApplyModel operator. The required input is a set of examples equipped with their pre-dicted labels as well as their true labels; the output of the operator is a performancevector, representing a statistical performance evaluation of a classification task byapplying several performance criteria [Akh14a].

X-Validation / Cross Validation This operator is a nested operator performing multipleiterations when it is activated. A given set of examples is split into a number of ksubsets (the number being customizable by a user parameter): k times, a model istrained by an operator realizing a machine learning algorithm; all subsets exceptone are used as training set - afterwards, the performance of the learned model isevaluated by applying it to the remaining subset as test set. This is executed forevery subset as test set once [Akh14b].

Optimize Parameters (Grid) This operator makes it possible to systematically test com-binations of parameter values for given operators in order to find the combinationresulting in a model achieving the best performance in the course of the process.The parameters to be optimized and the ranges of values to be tested can be chosenfreely; after testing every possible combination of values, the operator delivers theresults for the combination achieving the highest performance in the classificationtask. By applying this procedure, the operator manages to “isolate the influenceof one factor on overall performance” [JVDS14, p. 440].

Within an arrangement of these operators, an arbitrary operator implementing a ma-chine learning algorithm can be inserted in order to create an executable process whichapplies the algorithm to a given data set, measures the performance of the model learnedby the algorithm and finds optimal parameters for this model in order to maximize itseffectiveness. While a wide variety of machine learning concepts is already implementedin RapidMiner, HMMs are not included yet. The implementation of a RapidMiner ex-tension that introduces an operator able to learn a classification model based on HMMs

29

would extend the functionality of RapidMiner, while simultaneously making it possibleto use the infrastructure provided by RapidMiner in order to realize the classificationtask pursued in the present thesis (thereby satisfying requirement number 7). Inte-grating HMMs into RapidMiner requires the implementation of additional functionalityvia the rapidminer extension template3. As RapidMiner is programmed in the Javaprogramming language, the extension will be written in Java too.

3.1.3. The jahmm library

The RapidMiner extension enabling classification via HMMs will build on the basisfunctionality of HMMs as described in chapter 2.2. An adequate library implementingthis functionality in the Java programming language could provide a useful foundationfor the implementation to be created. For this purpose, the jahmm4 library is chosen. Itwas initially written by Jean-Marc Francois and is described as “a Java implementationof Hidden Markov Model (HMM) related algorithms [...] designed to be easy to use [...]and general purpose”5. It is distributed under the open source BSD license and thereforefreely available and expendable for academic purposes. The following classes providedby the jahmm library are especially important for the implementation to be created:

Observation This abstract class represents any kind of single observation processed by aHMM. Each subclass specifies a certain type of observation: ObservationIntegerfor integer numbers, ObservationReal for decimal numbers, ObservationVectorfor vectors of real numbers and finally ObservationDiscrete for discrete obser-vation values which have to be specified in a corresponding Enum.

Opdf This interface specifies an observation probability distribution function for anytype of observation. Implementing classes shall be able to return probability val-ues for any observation within the distribution, to generate a random observationaccording to the distribution and to update the distribution according to givenweights calculated during a learning process. Every type of observation men-tioned above has its own corresponding class implementing Opdf with respect tothe characteristics of the observation type: for example, the class OpdfInteger

simply defines the probability distribution as an array of double values, while theclass OpdfDiscrete takes the values in the Enum defining the alphabet of possibleobservations and maps them onto an underlying OpdfInteger of adequate size.

OpdfFactory This interface specifies a class for automatically creating Opdf instancesfor a given type of observation dependent on the implementing class. Classes likeOpdfIntegerFactory or OpdfDiscreteFactory create distributions over a givennumber of integers or observation values specified by an Enum with probabilityvalues that are uniformly distributed.

3https://github.com/rapidminer/rapidminer-extension-template4https://code.google.com/archive/p/jahmm/5https://code.google.com/archive/p/jahmm/

30

Hmm This class implements the basic structure of a Hidden Markov Model. It is alwaystypecast for a certain type of observation. The distribution of starting probabilitiesis represented as an array of double values, while the state transition probabil-ities are stored in a matrix of those. For each state, a probability distribution,implementing the Opdf interface for the type of observation the HMM is typecastfor, is available. A Hmm can be initialized with an arbitrary positive integer num-ber of states. Using this implementation of HMMs satisfies requirement number1. The constructor of the Hmm class allows the creation of a HMM with manuallypredefined values or distributions uniformly distributed via an OpdfFactory. Thisoption partially satisfies requirement number 2, but a procedure has to be addedthat automatically assigns random values that are not uniformly distributed inorder to fully meet the requirement6.

ForwardBackwardCalculator This class implements all calculations associated with theForward Algorithm and the Backward Algorithm, including the calculation of thetotal output probability P (O | λ) of an observation sequence for a given HMMas well as the calculation of the forward variable αt(i) and the backward variableβt(i) for an arbitrary state i at a certain point in time t. By using this class, therequirement number 3.a) is met.

BaumWelchLearner This class implements all calculations associated with the Baum-Welch-Algorithm, therefore enabling the training of a given HMM according to atraining set of observation sequences by updating the starting probabilities, statetransition probability distributions and output probability distributions of everystate. The training process can be executed for an arbitrary positive number ofiterations. Using this class satisfies requirement number 3.b).

Apart from the classes mentioned above that will be actively used in the contextof the RapidMiner extension to be implemented, the jahmm library implements a lotof additional functionality associated with HMMs, including for example the Viterbialgorithm and the processing of continuous observations through Gaussian distributions,Gaussian mixture distributions or multivariate Gaussian distributions. An overview ofthe full functionality of the jahmm library can be found in the user guide [Fra06].

3.1.4. Implementation

While the jahmm library implements the necessary basic functionality of the HMM con-cept, RapidMiner provides an infrastructure for executing, evaluating and optimizinggiven machine learning algorithms. The task of the RapidMiner extension to be im-plemented is to link these two elements and to satisfy the remaining requirements forcreating a classification system based on HMMs.

6This is necessary as, given a HMM with uniformly distributed probability distributions, the trainingof the HMM would only update the output probability distributions but never the starting stateprobabilities or the state transition probabilities.

31

A first aspect to be examined are the possible types of observations. The jahmm li-brary supports observations given as integers, real numbers or vectors of real numbersand discrete observations taken from an alphabet of symbols specified in an Enum. Theimplementation to be created only has to consider discrete observations (like the ActionUnits featured in FACS), therefore the usage of the class ObservationDiscrete basedon an arbitrary Enum would be sufficient in principle - the problem is that the alphabetof possible observations would have to be determined before running the program thisway, as an Enum always represents a fixed set of predefined constants7. Implementing anEnum of all possible AUs present in FACS manually would be acceptable with respect tothe concrete task of the present thesis, but this way the RapidMiner extension wouldwork for classification tasks involving AUs as observations exclusively. In order to gener-alise the functionality of the RapidMiner extension, the alphabet of possible observationvalues has to be extracted from the given example set dynamically. Therefore the rep-resentation as Enum has to be replaced by a more general representation that can becreated dynamically - in order to achieve this, alphabets of discrete observation valuesshall be represented as sets of String values. This new form of representation requiresthe introduction of some additional classes to the jahmm library:

ObservationString This class extends the abstract class Observation and realizes anobservation consisting of a String value.

OpdfString This class implements the interface Opdf and represents a probability dis-tribution over a given alphabet of String observations. It can be initialized ei-ther with a given set of probability values or by simply distributing probabilityvalues for a given alphabet of possible String observations uniformly. Like theOpdfDiscrete, it maps a number of discrete values onto an OpdfInteger of ade-quate size.

OpdfStringFactory This class implements the interface OpdfFactory: it realizes a fac-tory for instances of the OpdfString class which automatically generates distribu-tions with uniformly distributed probability values for a given alphabet of Stringobservations.

With these classes providing the opportunity of a dynamic extraction of an alphabetof discrete observation values, the next step of the implementation can be realized.Requirement number 4 involves the labelling of HMMs and observation sequences withtheir associated classes - this is implemented as follows:

LabeledHmm This class wraps a single Hmm (typed for String observations as only dis-crete observations will be considered in this version of the implementation) and itsassociated label (given as String).

LabeledObservationSequence This class wraps a list of String observations and itsassociated label (given as String).

7https://docs.oracle.com/javase/tutorial/java/javaOO/enum.html

32

Like the set of possible observations, the set of possible classes shall be extracted froma given example set dynamically in order to ensure the universal applicability of theRapidMiner extension. As both sets will be needed in the course of the training process,they have to be stored in a variable:

Alphabet This class stores the alphabet of possible observations as well as the alphabetof possible labels for a given HMM classification task (both of them implementedas sets of String values declared as final). These sets shall be extracted from thetraining set before starting the learning process: this way, the training set can beseparated properly and HMMs can be customized according to the given classesand observations.

Given these labelled entities, several classes can be implemented that execute thetask of learning HMMs associated with certain labels from a training set of labelledobservation sequences:

RandomHmmCreator This class creates a HMM (typecast for String observations) withrandomized probability values for a given number of states and a given alphabetof possible observation values. In order to avoid uniformly distributed probabilityvalues created by an OpdfFactory, the RandomHmmCreator uses the helper classRandomizer, which creates random probability distributions as arrays of double

values that always add up to 1.0. In order to make the results of the generation of“random” values repeatable, the Randomizer is initialized with an instance of theRapidMiner class RandomGenerator, which creates pseudo-random values basedon a given global or local seed.

SingleHmmLearner This class learns a HMM (typecast for String observations) from agiven training set of observation sequences (each of them given as list of instances ofthe class ObservationString) for a given number of iterations. The HMM to startthe learning process with is created by the RandomHmmCreator for a given numberof states, a given alphabet of possible observations and a given RandomGenerator;afterwards, the Baum-Welch-Algorithm is executed for the given number of itera-tions, training the HMM according to the given training set by applying the jahmmclass BaumWelchLearner.

MultiHmmLearner This class receives a training set of labelled observation sequences(given as instances of the class LabeledObservationSequence) and an alphabetof possible labels. It separates the sequences according to the labels in the alpha-bet; for every label, a LabeledHmm is learned from the amount of training sequencesbelonging to the label by applying the SingleHmmLearner with the sequences, agiven number of states, a given number of iterations, a given alphabet of obser-vations and a given RandomGenerator as parameters. This procedure results in anumber of labelled HMMs, each of them being trained for the subset of trainingsequences belonging to a certain label.

33

The possibility to learn a number of labelled HMMs from a mixed set of labelled obser-vation sequences lays the foundation of the HMM-based classification model specified inRequirements number 5 and 6. The last step of the implementation is the integration ofthis learning procedure into RapidMiner by creating an operator applying the algorithmin order to produce a model capable of executing classification tasks:

HMMLearner This class extends the original RapidMiner class AbstractLearner (whichitself extends the class Operator and therefore symbolises a RapidMiner opera-tor): it receives an instance of the RapidMiner class ExampleSet which is used astraining set and has to deliver an instance of a class implementing the RapidMinerinterface Model. In order to achieve this, it extracts the Alphabet of possible ob-servations and labels as well as the training set in the form of instances of the classLabeledObservationSequence from the ExampleSet - this procedure is realizedby using the helper class InformationExtractor. The training process can beconfigured by the user through a number of parameters:

sequence attribute name The Attribute holding the sequence of observations tobe extracted (chosen from the number of available attributes).

number of states The number of states the learned HMMs should consist of (cho-sen from a range between 1 and Integer.MAX VALUE).

iterations The number of iterations the HMMs should be trained for (chosen froma range between 1 and Integer.MAX VALUE).

MAP classification The decision whether MAP classification shall be applied (cho-sen as checkbox; if no MAP classification is chosen, ML classification is ap-plied instead).

use local random seed The decision whether a local random seed should be usedto initialize the RandomGenerator creating the random probability valuesfor the HMM initialization (chosen as checkbox; if a local random seed ischosen, every HMM is instantiated with the same random probability values- otherwise, all random values are generated by the global RandomGeneratorof the RapidMiner process, resulting in different probability values every timea new HMM is created).

local random seed If using a local random seed was chosen: the value of the localrandom seed (chosen from a range between 1 and Integer.MAX VALUE).

All the important information for the learning process, either extracted from thetraining set or requested as parameters (including the alphabet of observations, thealphabet of labels, the training set of labelled observation sequences, the numberof states, the number of iterations, and the RandomGenerator initialized eitherwith a given local seed or globally) are used by the class MultiHmmLearner tolearn a number of labelled HMMs which form the basis of the classification model.Additionally, the prior class probabilities are calculated by determining the numberof examples belonging to a certain class relative to the number of examples in thewhole example set. Finally, a model is instantiated with the given number of

34

labelled HMMs, a HashMap containing the prior class probability for each label,and a boolean indicating whether MAP classification should be performed or not- this model is described in the following.

HMMClassificationModel This class extends the class SimplePredictionModel (whichitself implements the RapidMiner interface Model): it must be able to predicta label for each single Example of a given ExampleSet used as test set. It isinitialized by the HMMLearner and contains a number of labelled HMMs equivalentto the number of different labels in the context of a classification task. Given anExample from the test set, the model extracts the stored observation sequence byusing the helper class InformationExtractor. Afterwards, it calculates the totaloutput probability for that sequence for every single HMM contained in the model.Depending on whether MAP classification is activated, the procedure continues asfollows:

• If MAP classification is deactivated, the label associated with the HMMachieving the highest total output probability is chosen as the predicted labelfor the Example.

• If MAP classification is activated, every total output probability value ismultiplied with the associated model’s prior class probability (stored in theHashMap created during the learning process); afterwards, the label associatedwith the HMM achieving the highest probability value this way is chosen asthe predicted label for the Example.

In either case, additional attributes are created holding the confidence values as-sociated with the possible labels. The confidence value associated with a label iscalculated as the probability previously computed for the sequence by the label’sHMM (determined by either ML or MAP classification), divided by the sum ofall those probabilities over all labels. If, for any reason, every single HMM in themodel calculates the probability of a given sequence as 0, the following steps areapplied instead of the normal classification process:

• The label with the highest prior class probability is chosen as predicted label.

• If MAP classification is deactivated, the confidence for every label is definedas guess probability (calculated as 1 divided by the number of possible labels).

• If MAP classification is activated, the confidence for every label is defined asthe label’s prior class probability (stored in the HashMap created during thelearning process).

The implementation described above enables RapidMiner to perform classificationtasks for examples containing labelled sequences of discrete observations. All require-ments specified in chapter 3.1.1 are met by this approach. What is still missing is thepossibility to process parallel observations occurring in the course of an observationsequence. Two different approaches of satisfying this additional requirement will bepresented in the following chapter.

35

3.2. Approaches of handling parallel observations

The problem of parallel observations occurring during an observation sequence in thecontext of a HMM-based classification task can be solved in two ways: either by modi-fying the representation of the given data basis of observation sequences in order to fitthe given implementation of the HMM-based classification system, or by modifying theimplementation of the system itself in order to enable it to process parallel observations.Both approaches will be presented in the following.

3.2.1. Extending the alphabet of possible observations

The first approach follows the idea of Schmid et al. [SSS+12] and Siebers et al. [SSS+16]:whenever two or more AUs with the same onset time appeared in a sequence of AUs, theywere treated as a self-contained symbol in the alphabet of possible symbols, ensuringthat only one AU or AU-compound exists for any given onset time found in the sequence.This approach can be generalized for any kind of observation sequence and applied tothe concept of HMM-based classification. Each combination of parallel observationsbecomes a new symbol of the observation alphabet, therefore having its own probabilityvalue in the output probability distribution of every state within a HMM. This way, a fastand simple solution for the problem of parallel observations is provided, with the onlyrequirement that the incoming example sets shall represent any combination of parallelobservations as a single self-contained symbol in a consistent way; the implementationof the RapidMiner extension would not have to be changed in any way. The only directimpact of this procedure on the learned HMM-based classification system is an increasingsize of the alphabet of possible observations that have to be considered, and thereby anincreasing size of the output probability distributions belonging to the states of the singleHMMs.

By applying this approach, a new observation symbol, created by combining sev-eral original symbols, becomes independent of all other symbols, even those it consistsof - this means that the output probability of a combined observation AB cannot bederived in any way from the probabilities of the single observations A and B, as allthree probability values develop independently: in the original implementation of theBaum-Welch-Algorithm, the occurrence of a given observation symbol during a trainingsequence will increase the probability of that symbol in every updated output probabil-ity distribution, which will automatically result in a decrease of all the other probabilityvalues within the distribution.

Consider the following example: a HMM is learned for the class pain from AU se-quences labelled as pain sequences. Every time the combination of, for instance, AU4and AU7 appears within a training sequence, the output probabilities of the new symbolAU4-7 will increase, while the output probabilities of all other symbols, including AU4and AU7, will decrease automatically. Therefore, a HMM learned for the class pain froma training set containing many instances of the combined observation AU4-7, but fewinstances of AU4 and AU7, will assign a relatively greater probability to test sequencescontaining instances of the combined observation AU4-7 than to test sequences con-

36

taining only instances of the single observations AU4 and AU7 - this implies that thecombination of these two AUs is indicative for pain while the single AUs themselves arenot.

Whether this independent development of the probability values of combined observa-tion symbols is desirable or not depends on the premises about the underlying process ofa given classification task. Assuming that a given combination of observations containsa genuinely different semantics than the combined semantics of the single observationsbeing part of the combination, it is reasonable to adopt the approach described above.On the other hand, assuming that a combination of observations being indicative for acertain class shall result in the single components being indicative for the given class aswell, an alternative approach representing dependencies between combined observationsand their components might be a better alternative.

Every time a HMM-based classification model is faced with a test sequence containingan observation that is not covered by any HMM’s output probability distributions (be-cause that observation was not part of any of the training set’s sequences), the predictedlabel as well as the confidence values for the given test sequence will have to be guessedinstead of being calculated properly (this process is described in chapter 3.1.4). Thebigger an alphabet of possible observations for a given application domain is, the morelikely it is that a training set might not cover all possible observations, and that test se-quences containing uncovered observations will have to be classified by guessing becauseof this. Treating combinations of observations as additional self-contained observationsymbols will increase the size of an alphabet even more and therefore might increase thenumber of occurrences of this unwanted effect.

Considering a HMM trained for the class pain which is able to provide probabilityvalues for the following observations: AU4, AU7, AU10, AU4-7 and AU4-10 ; whenconfronted with a test sequence containing a new combination of parallel observationsnot considered during the training process, for example AU7-10, the model will not beable to calculate any probability for the sequence - a probability of 0 has to be returnedand the class of the training sequence has to be guessed. Intuitively, one could thinkthat this scenario might be avoided: as knowledge about the components of an unknownparallel observation is available, it should be possible to derive an estimation of theprobability of the unknown combination based on the given information. The problemis that this is only possible in cases where the probability values of combinations ofobservations and the probability values of their single components are interdependentsomehow.

With these considerations in mind, an alternative solution for the handling of parallelobservations will be presented in the following. It will aim at creating dependencies be-tween single observation symbols and their combinations, making it possible to calculateprobability values for parallel observations dynamically - even for those combinationsnot covered by the training set.

37

3.2.2. Processing sets of observations

Every output probability distribution over a number of K possible observation valueso1, ..., oK within a Hidden Markov Model is subject to the condition that the singleprobability values within the distribution always add up to 1 for any arbitrary state j:

bj(o1) + bj(o2) + ...+ bj(oK) = 1. (3.1)

Using such a distribution implies that the only possible output of state j is exactly oneof the given possible observation values - but when considering parallel observations, anyarbitrary combination or set of the given observation values might be a possible outputof the state as well and has to be included in the distribution:

bj(o1) + bj(o2) + bj(o3) + ...+ bj(oK) + bj(o1, o2) + bj(o1, o3)+ ...+ bj(o1, o2, ..., oK) = 1.

(3.2)

The approach described in the previous chapter provides a pragmatic solution to thisproblem by adding combinations of observations to the alphabet of possible observationvalues as self-contained symbols with an own probability value (therefore, a distributionas described in equation 3.1 is created again, only for a larger K; still, any possibleoutput is only one observation value from the alphabet). This procedure is appliedonly for those combinations that actually appear in a given training set, assuming thatconsidering those combinations will be sufficient for the processing of any future testsequence from the same application domain as well. Even if the approach would attemptto consider all possible combinations regardless of the training set and initialize themwith certain probability values, the training procedure would result in a probability of 0for every observation not appearing in the training set, as the denominator in equation2.23 will always become 0 in this case.

The alternative approach described in the following attempts to consider all combina-tions of parallel observations for any given alphabet of observation values, regardless oftheir appearance in the training set. This is done by modelling dependencies betweencombinations of observations and their single components. The basic premise of this ap-proach is that these dependencies really exist in the natural process a HMM is supposedto model. Whether this assumption holds might depend on the application domain andtherefore will have to be evaluated on a case by case basis.

A probability distribution as described in equation 3.1 implies that, if one observationvalue is chosen as output value, the other observation values cannot be part of the outputany more - but as the existence of parallel observations shows, several observation valuescan be outputted at the same time. Therefore, the definition of an observation O at agiven point in time t will be redefined: instead of treating the observation as an entitythat equals a single observation value (Ot = ok), it will be thought of as a set thatcontains a given observation value (ok ∈ Ot) - this way, any additional observation valuecan become part of the set, no matter how many other values are already part of it.

In order to model this situation, every possible observation value ok of an observationalphabet is assigned a probability value for being part of an observation Ot: P (ok ∈ Ot).

38

It will be assumed that this probability value is independent of the fact whether anyother observation value ox is already part of the observation or not:

P (ok ∈ Ot | ox ∈ Ot) = P (ok ∈ Ot). (3.3)

In order to integrate this idea into the HMM concept, the structure of the HMM rep-resentation is slightly redefined: the values stored in the matrix of output probabilitiesB shall indicate whether an observation value belongs to an observation instead of indi-cating whether that observation value is the observation - therefore, B as described inequation 2.6 is redefined as:

B = bjk | bjk = P (ok ∈ Ot | St = j). (3.4)

In this context, a comprehensive probability distribution over the single observationvalues as described in equation 3.1 is not sufficient any more in order to determine theprobability of a given observation. According to the classic HMM concept, the question“Which observation value represents the given observation?” is asked for the completealphabet of possible values; but for the new concept, the question “Is this observationvalue a part of the given observation set?” has to be asked separately for every singleobservation value within the alphabet. The alternative scenario to an observation valueappearing in an observation is not another observation value appearing (as this can bothhappen in parallel), but the very same observation value not appearing instead. In orderto correctly represent this in terms of probability distributions, a separate probabilitydistribution has to be defined for every possible observation value ok, only consideringthe cases ok ∈ Ot (termed as occurrence probability in the following) and ¬(ok ∈ Ot)(termed as non-occurrence probability):

P (ok ∈ Ot) + P (¬(ok ∈ Ot)) = 1. (3.5)

With the new definition of B as in equation 3.4, the value for P (ok ∈ Ot) can beeasily accessed via bj(ok). As every probability distribution defined for an individualobservation value consists of only two elements, the value for P (¬(ok ∈ Ot)) can easilybe calculated as well:

P (¬(ok ∈ Ot)) = 1− bj(ok). (3.6)

The result of the redefinition of B via equation 3.4 is that the condition stated inequation 3.1 does not hold any more, because every bj(ok) represents a value in anindependent probability distribution now that only correlates with its associated non-occurrence probability but not with the other bj(ok) values. By considering all possiblecombinations of the values represented in the separate probability distributions, a prob-ability distribution as described in equation 3.2 shall be established instead.

Previously, the probability of an observation, bj(Ot), was calculated by simply repro-ducing the probability value that is stored for the observed observation value ok in thematrix of output probabilities B:

39

bj(Ot) = bj(ok | Ot = ok). (3.7)

Now that every observation value ok has its own probability distribution consisting ofthe occurrence probability P (ok ∈ Ot) and the non-occurrence probability P (¬(ok ∈ Ot)),the output probability of an observation bj(Ot) has to be calculated by considering everypossible observation value ok one by one and integrating the probability value for itsoccurrence or non-occurrence into the final probability: whenever the value is part of theobservation, the occurrence probability accessed as bj(ok) has to be considered; when itis not part of the observation, the non-occurrence probability calculated as 1−bj(ok) hasto be considered instead: therefore, the probability of an observation is the product of alloccurrence probabilities of values contained in the observation and all non-occurrence-probabilities of the values not contained, calculated for a given observation alphabetconsisting of K possible observation values:

bj(Ot) =K∏k=1

bj(ok) if ok ∈ Ot1− bj(ok) else

. (3.8)

The Forward Algorithm and the Backward Algorithm described in chapter 2.2.3 canremain unchanged in their structure as long as they implement the new calculation ofan observation’s probability bj(Ot) as described in equation 3.8. The Baum-Welch-Algorithm can calculate the π and aij values (equations 2.22 and 2.21) just as beforeas those calculations are not affected by the new output probability distributions; thebj(ok) values (originally calculated as described in equation 2.23) have to be calculatedby a slightly different procedure though:

bj(ok) =

∑Tt=1 P (St = j, ok ∈ Ot | O, λ)∑T

t=1 P (St = j | O, λ)=

∑t:ok∈Ot

P (St = j | O, λ)∑Tt=1 P (St = j | O, λ)

=

∑t:ok∈Ot

γt(j)∑Tt=1 γt(j)

.

(3.9)

While previously the denominator of the equation was calculated by summing up allγ values for every point in time where an observation equalled the examined observationvalue, now all γ values for every point in time where an observation included the givenvalue are summed up instead.

By realizing this new concept, every possible combination of occurrences and non-occurrences over a given alphabet of possible observation values can be calculated. As allprobability values of parallel observations can be calculated based on the probability val-ues of single observation values, even the probabilities of combinations never appearingin the training set can be derived from the given information this way. It should be notedthat, by applying this approach, observations consisting of a single observation value aretreated as sets of observations as well, calculated as the occurrence probability of the sin-gle appearing value multiplied with the non-occurrence probabilities of all other possiblevalues. For example, given the observation alphabet AU4,AU7,AU10, the probability

40

of the observation AU4, that simply would have been calculated as bj(AU4) by the clas-sic approach, is now calculated as bj(AU4)(1− bj(AU7)(1− bj(AU10))). This exampledemonstrates a major disadvantage of the approach, as the algorithms calculating prob-abilities for given observations have to iterate through the whole alphabet of possibleobservation values every time a new observation probability has to be determined - evenfor the simplest observations containing only one value. For large observation alphabets,this might result in a considerable increase of required computation time.

As the values within the binary probability distribution of each single observation valueadd up to 1, the combined probabilities of all possible combinations of these values addup to 1 as well. One problem that arises is that the combination of all non-occurrenceprobabilities over the alphabet of possible observation values is a valid combinationtoo, resulting in an empty observation set containing no values at all. Therefore, thecondition stated in equation 3.2 can only be met if the probability for an empty set ∅ ofobservations is included as well:

bj(∅) + bj(o1) + bj(o2) + bj(o3) + ...+ bj(oK) + bj(o1, o2)+ bj(o1, o3) + ...+ bj(o1, o2, ..., oK) = 1.

(3.10)

In order to avoid the problem of an empty set of observations being possible, it wouldbe necessary to define a constraint on the generation of observations not allowing thecreation of an empty observation set. The values produced for all other possible combi-nations would have to be scaled dynamically at every point in time in order to satisfythe condition in equation 3.2 again. However, realising such a procedure is not trivialand would go beyond the scope of the present paper.

An alternative solution of this problem would be to simply assume that the possibilityof an empty observation set does not affect the concept adversely. It could even beassumed that a real process, creating a sequence of observations based on the successionof invisible internal states, might indeed go through a state at some point withoutproducing any perceivable observation - therefore two consecutive observations only seemto be the product of two consecutive states although they are not. In this admittedly veryhypothetical scenario, a HMM allowing the creation of empty observation sets might evenbe able to model the examined process more accurately than a HMM without this option.In the end, this theory of the existence of intermediate states creating no observationscan neither be verified nor falsified, as ultimately the sequences of observations are theonly thing that is perceivable in a context where HMMs are applied.

Either way, it seems acceptable to permit the theoretical possibility of empty ob-servation sets in the present thesis, as it does not affect the basic functionality of thepresented concept, especially not in the case of the given classification task: the trainingand test sequences to be considered do not contain any empty observation sets in theirgiven representation. Nevertheless, the general problem described here should be keptin mind for future considerations.

As the approach described in this chapter requires some changes in the basic conceptof HMMs, the implementation of the concept presented in chapter 3.1.4 will have to

41

be extended as well in order to realize the new functionality. These extensions willaffect certain classes contained in the jahmm library and the RapidMiner extension.As the original functionality of the implementation shall remain unchanged in orderto still represent the classic approach of HMM-based classification, all new aspects offunctionality will be implemented in separate classes, extending the original ones andadding or overwriting all necessary methods.

Extending the jahmm classes

The original jahmm classes are only able to process single instances of the Observation

class (or lists of these instances representing observation sequences). The main purpose ofthe new additional classes will be the implementation of a similar functionality being ableto process sets of observations instead of single observations; lists of sets of observationswill represent observation sequences containing parallel observation symbols:

OpdfParallel This interface extends the interface Opdf and adds new methods thathave to be provided by the implementing classes: the required methods include theprobability calculation for sets of parallel observations and the random generationof such sets.

OpdfIntegerParallel This class extends the original class OpdfInteger and imple-ments the interface OpdfParallel: the required methods for processing and gen-erating sets of parallel observations are realized for the type ObservationInteger.The probability values for the possible observations are still stored in an array ofdouble values, but while they represented one single distribution of probabilitiesin the original OpdfInteger class (and therefore always had to add up to 1.0),this time every value represents an independent probability distribution for onepossible observation: the value itself represents the occurrence probability of thatobservation, while the non-occurrence probability is implicitly given and can beaccesses by subtracting the given value from 1.0.

OpdfStringParallel This class extends the class OpdfString and implements theinterface OpdfParallel: it is based on a mapping of String values onto anOpdfIntegerParallel. Therefore it is able to process sets of instances of theclass ObservationString which represent parallel discrete observations.

OpdfIntegerParallelFactory and OpdfStringParallelFactory These classes imple-ment the original interface OpdfFactory and serve the cause of automatically cre-ating the two new types of probability distributions as described above. Whileother OpdfFactory implementations create distributions with values that are uni-formly distributed, these implementations create a number of randomly chosenvalues between 0.0 and 1.0: the single values represent an individual probabilitydistribution each and therefore are not directly interdependent any more.

HmmParallel This class extends the jahmm class Hmm and represents the slightly updatedversion of a Hidden Markov Model, now being able to process parallel observations

42

given as sets of Observation values. The constructor of the HmmParallel onlyaccepts those instances of Opdf that also implement the interface OpdfParallel.This makes sure that the represented HMM is able to calculate probabilities forobservation sequences containing parallel observations according to the conceptdescribed in this chapter.

ForwardBackwardCalculatorParallel The class ForwardBackwardCalculator is ex-tended by this class in order to enable it to handle parallel observations whencomputing the α- and β-values and the total output probability of a given se-quence of observations: only instances of HmmParallel are accepted in order toensure the updated probability calculation for parallel observations as describedin equation 3.8. Single observations are wrapped in sets of observations consistingof one element in order to be calculated according to the new implementation.

BaumWelchLearnerParallel This class extends the class BaumWelchLearner in order toenable it to handle parallel observations during the learning process - the trainingof observation probabilities is updated according to equation 3.9. It only learns in-stances of HmmParallel and only takes sequences of sets of observations as trainingsequences, although sequences of single observations can be processed by simplywrapping each observation in a set containing only one element.

Extending the RapidMiner extension

After the jahmm library is extended with appropriate classes that process sets of obser-vations, the RapidMiner extension described in chapter 3.1.4 can be extended as well inorder to integrate this updated HMM concept into the RapidMiner infrastructure. Mostof the altered classes maintain their basic functionality, with the key difference that theyare able to process sets of observations instead of single observations within the trainingand test sequences:

LabeledParallelHmm This class works like the LabeledHmm, with the difference that itwraps a HmmParallel and its associated label instead of a normal Hmm.

LabeledParallelObservationSequence This class wraps a list of sets of observationsand an associated label, therefore realizing the LabeledObservationSequence forparallel observations.

RandomHmmCreatorPlus, SingleHmmLearnerPlus, MultiHmmLearnerPlus For the pur-pose of processing sets of parallel observations, the classes RandomHmmCreator,SingleHmmLearner and MultiHmmLearner are replaced by these three classes thatimplement the same functionality, with the difference that they use lists of sets ofobservations as training sequences instead of normal lists of observations. Withinthese classes, all instances of the classes OpdfString, Hmm, BaumWelchLearner,LabeledHmm and LabeledObservationSequence are replaced by their counterpartsable to process parallel observations.

43

HMMLearnerParallel This class realizes a RapidMiner operator for processing obser-vation sequences containing sets of parallel observations. It extends the originaloperator HMMLearner. The updated helper class InformationExtractorPlus sep-arates observation sequences in two steps: in the first step, the single elements ofan observation sequence held in the specified attribute are extracted just as in theoriginal implementation - but instead of treating each element as a self-containedsymbol, they are split up again into their single components in every occurrence ofa given character8. This way, parallel observations are represented as sets of theirsingle components. The functionality of the operator and the parameters that canbe configured remain the same, except for the fact that sets of observations areprocessed instead of single observations, and the fact that the updated classes asdescribed above are used during the learning process instead of the original classes.The learned model is an instance of the class HMMClassificationModelParallel.

HMMClassificationModelParallel This class provides the same functionality as theoriginal HMMClassificationModel, with the only difference that sets of observa-tions are processed instead of single observations.

The updated implementation of the RapidMiner extension provides two different op-erators now: the Hidden Markov Model operator realizes the classic HMM concept andis only able to process parallel observations when they are given as self-contained sym-bols in the representation of a given data basis. The Hidden Markov Model (ParallelObservations) operator splits parallel observations into sets of their components insteadand processes them according to the updated HMM approach described in this chapter.Both operators will be evaluated in the following chapter.

8In the given implementation, this is the hyphen symbol.

44

4. Evaluation of the HMM-basedclassification system

The implementation described in chapter 3 allows to integrate HMM-based classificationprocesses into the structure of RapidMiner. Two different operators can be chosen tocarry out the procedure of learning a HMM-based classification model, each of them re-alizing a different approach of processing parallel observations. Both of these approacheshave their advantages and disadvantages; in order to measure their effectiveness, theyhave to be evaluated on the basis of a concrete classification task - this task will includethe classification of facial expressions decoded as sequences of Action Units (contain-ing parallel observations in the form of sets of AUs with the same onset time). Theevaluation described in the following addresses three objectives:

1. The two different approaches for handling parallel observations shall be comparedto each other in order to find out which one performs better when applied to aconcrete classification task.

2. The general suitability of a HMM-based classifier for the task of facial expressionclassification shall be evaluated. This might provide insight into the relevance ofsequential information for the correct classification of facial expressions.

3. Finally, the impact of different representations of the data basis on the perfor-mance of the classifier shall be evaluated: alternative representations of the givenfacial expressions include the highlighting of potentially pain-relevant AUs by gen-eralizing or even removing all other AUs declared as pain-irrelevant.

The evaluation will be executed by applying HMM-based classification in its classicalas well as its updated approach to a given set of AU sequences. The procedures of per-formance measurement and parameter optimization will be provided by the RapidMinerinfrastructure. The test setup is described in detail in the following; afterwards, theresults will be presented.

4.1. Test setup

The evaluation was realized in RapidMiner Studio1 version 7.4. The extension describedin chapter 3 was implemented by using the rapidminer extension template. In this partof the chapter, the given data basis used for the evaluation will be presented, followedby a description of the composition of the evaluation processes.

1https://rapidminer.com/products/studio/

45

4.1.1. The data basis and its representations

The sequences of AUs that are used as examples in the evaluation described in the presentpaper are extracted from a data basis that was created in the course of a study conductedby Karmann et al. [KML+16]. The data basis contains sequences of facial expressionsrecorded during the induction of pain, disgust or no given impression (treated as neutralsequences). The facial expressions were manually decoded via the Facial Action CodingSystem. For every sequence, the following details were recorded:

• Every occurrence of an AU within a sequence is represented as one event includingthe name of the AU, its onset time relative to the beginning of the sequence, itsduration of occurrence and its intensity (for those AUs for which the intensity canbe measured).

• The sentiment that was induced during a sequence is given as the sequence’s induc-tion, which can either be pain, disgust or neutral (meaning no specific sentimentwas induced).

• For the subject the sequence was recorded for, personal details were capturedcontaining the person’s gender, age and height.

Representation

The representation of the sequences being used for the evaluation was extracted fromthe data basis via SQL queries: for every sequence, the AUs were grouped by theironset time: each of those groups was transformed into a single String, representing anobservation at a given point in time. If two or more AUs were part of a group, they wereordered by their number and separated via hyphens within the String, representing acombination of parallel observations. Afterwards, the created String values representingAUs and AU compounds were transformed into a single String again for every sequence,ordered by their onset time and separated by blank spaces. The resulting String valueswere stored in an attribute representing a complete AU sequence for every record. Twoadditional attributes contain the ID of the sequence and the induction that is associatedwith the sequence, therefore making every entry a labelled observation sequence that canbe used in a classification context. Additional details like an AU’s duration and intensityor a subject’s personal data were excluded from the representation, as only the successionof AUs was supposed to be focussed on.

The resulting table includes the three attributes sequence id, au sequence and induc-tion name. It contains 259 entries, each of them representing a labelled observationsequence of AUs and AU compounds. On this basis, several variations of the represen-tation of the example set were created - they will be presented in the following.

Pain-relevant Action Units

The third objective of the evaluation consists of investigating the impact of differentrepresentations of the example set on the results of the evaluation process. This approach

46

follows the idea of Siebers et al. [SSS+16]: during a classification task including sequencesof AUs, two different representations of the alphabet of possible observation values werecompared to each other: the first version featured all possible AUs and AU compoundsdetected in the original data basis without alteration; the second version determineda number of AUs that were found to be relevant for the classification of pain in theliterature concerned with pain assessment - those AUs were highlighted in the alphabetof possible observation symbols by replacing every other AU by a generalized “wild card”Action Unit I, only leaving the pain-relevant AUs unchanged.

The present thesis will apply the same approach by comparing the original alpha-bet of possible observation values with a reduced alphabet containing only potentiallypain-relevant AUs plus one wild card AU replacing all other possible AUs. A third al-ternative will remove the pain-irrelevant AUs completely instead and therefore result ina representation exclusively consisting of pain-relevant AUs.

The set of AUs that are labelled as pain-relevant will be taken from the classic studyconducted by Prkachin [Prk92], who identified the following six AUs as universally in-dicative for pain: AU4 (Brow lower), AU6 (Cheek raise), AU7 (Lid tighten), AU9 (Nosewrinkle), AU10 (Upper lip raise) and AU43 (Eyes close). These findings were confirmedby Kunz et al. [KSH+07], who identified five out of these six AUs (AU4, AU6, AU7,AU9 and AU10) as pain-relevant as well. For the sake of completeness, the evaluationdescribed in the present paper will be based on the original set containing all six values.

The three alternative representations that have to be compared to each other in thecourse of the evaluation are stored in three different CSV files:

• The file Standard.csv contains the original 259 sequences of AUs and AU com-pounds, including all possible AUs without any modification, with their associatedID and label.

• The file Replace.csv contains the given 259 sequences after replacing every occur-rence of an AU that does not belong to the set of pain-relevant AUs by the wildcard AU I. If two or more pain-irrelevant AUs are contained in an AU compound,each of them is replaced by a separate instance of AU I - for example, the AUcompound 1-2-4 would become I-I-4.

• The file Reduce.csv contains the result of removing every AU from the represen-tation that does not belong to the set of pain-relevant AUs. This results in somesequences not containing a single AU any more - those sequences are removed fromthe representation, leaving a set of 155 remaining labelled sequences in the table.

Minimal sequence size

The example sets as they are given in the three files described above cannot be pro-cessed by the RapidMiner operators learning HMM-based classification models yet: thereason is that some of the contained sequences only consist of one observation (or com-bination of parallel observations); when the class BaumWelchLearner (or its extension

47

BaumWelchLearnerParallel), implementing the procedure of training a HMM accord-ing to a given training set, is confronted with a training sequence that consists of onlyone element (or less), it throws an IllegalArgumentException stating that the obser-vation sequence is too short. This makes sense as the ξ values, which are important forupdating the state transition probabilities of a HMM, can only be calculated for everypoint in time up to T − 1; the value T represents the length of a sequence, measuredin time steps - for the last point in time T , no more state transitions can be measured,as the sequence ends at this point - therefore, it is excluded from the calculation of theξ-values. A sequence containing only one observation has a length of T = 1 - therefore,no ξ values can be calculated by the Baum-Welch-Algorithm at all, and the learningprocess cannot be executed properly.

To avoid this problem, all sequences consisting of only one AU or AU compoundare removed from the given representations of the data basis - as the present thesisinvestigates the relevance of sequential information, which cannot be present in sequencescontaining only one element, they are dispensable for the given classification task eitherway. This results in three new files:

• The file Standard MinimalSize.csv contains all sequences from the standard repre-sentation that consist of two or more AUs or AU compounds, resulting in a numberof 170 sequences.

• The file Replace MinimalSize.csv contains the same 170 sequences consisting ofmore than one observation, with all pain-irrelevant AUs being replaced by thewild card AU I.

• The file Reduce MinimalSize.csv contains all sequences from the original exampleset that still contain two or more observations after all pain-irrelevant AUs havebeen removed from the sequences, which results in a number of 77 sequences.

Three class vs. two class problems

The given representations of the data basis differentiate between the three classes pain,disgust and neutral. Whether this differentiation considering all three classes is necessarydepends on the application domain the learned classification model is supposed to be usedin: a system with the purpose of detecting any arbitrary sentiment a person is expressingwould have to differentiate between a wide variety of possible classes; however, a systemdesigned for the exclusive purpose of pain detection in a person’s facial expressions wouldonly have to differentiate between pain and anything else that is not pain. As paindetection is a fundamental issue of the present thesis, the latter case shall be consideredas well and therefore be covered by the given set of representations. Three alternativerepresentations are created that aggregate the classes disgust and neutral into the moregeneral class no pain:

• The file Standard MinimalSize TwoClass.csv contains the same sequences as thefile Standard MinimalSize.csv but aggregates the classes disgust and neutral intothe class no pain.

48

total pain disgust neutral no pain

Standard 259 54 123 82 -

Standard MinimalSize 170 36 87 47 -

Standard MinimalSize TwoClass 170 36 - - 134

Replace 259 54 123 82 -

Replace MinimalSize 170 36 87 47 -

Replace MinimalSize TwoClass 170 36 - - 134

Reduce 155 40 92 23 -

Reduce MinimalSize 77 26 44 7 -

Reduce MinimalSize TwoClass 77 26 - - 51

Table 4.1.: Distribution of classes within the different representations of the data basis

• The file Replace MinimalSize TwoClass.csv realizes the same procedure for the fileReplace MinimalSize.csv.

• The file Reduce MinimalSize TwoClass.csv defines a two class problem for theoriginal file Reduce MinimalSize.csv.

Finally, a total of nine possible representations is available for the given data basis.The distribution of classes within these representations is presented in table 4.1. Asthree of these representations (Standard, Replace and Reduce) contain observation se-quences consisting of only one AU or AU compound and therefore cannot be processedby the RapidMiner operators realizing HMM-based classification, only the remaining sixrepresentations will be considered in the process buildup presented in the following.

4.1.2. Process buildup

Each one of the six chosen representations of the data basis will be integrated intoa separate RapidMiner process. The combination of the operators that make up theprocesses, the chosen values for parameter optimization, and the performance criteriathat shall be applied will be presented in the following.

Operator combination

Every process starts with the import of the data set to be evaluated via a Retrieveoperator. As no attribute in the data set is associated with a special role at this point, theSet Role operator is used to assign the role Label to a given attribute in order to enableclassification processes in the first place; this is done for the attribute induction namefor every data set. The next step is the Optimize Parameters (Grid) operator, whichreceives the data set as input: combinations of parameter values chosen for this operatorwill be described below.

Within the operator optimizing the parameters, a Cross Validation operator is located,splitting the data set into folds used for the training and testing procedures. A Log

49

action units parallel observations classes

Standard all extended alphabet 3

Standard NewOperator all observation sets 3

Standard TwoClass all extended alphabet 2

Standard NewOperator TwoClass all observation sets 2

Replace pain-relevant + I extended alphabet 3

Replace NewOperator pain-relevant + I observation sets 3

Replace TwoClass pain-relevant + I extended alphabet 2

Replace NewOperator TwoClass pain-relevant + I observation sets 2

Reduce pain-relevant extended alphabet 3

Reduce NewOperator pain-relevant observation sets 3

Reduce TwoClass pain-relevant extended alphabet 2

Reduce NewOperator TwoClass pain-relevant observation sets 2

Table 4.2.: Overview of the RapidMiner processes used for evaluation

operator stores the values that are used for the HMM parameters number of statesand iterations, as well as the resulting performance value (main criterion only) and itsdeviation, for every complete execution of the Cross Validation operator (i. e. for everypossible combination of tested parameter values).

The testing procedure within the cross validation is executed by the Apply Modeloperator, which predicts a label and calculates the confidence values for every examplein the test set. The updated test set containing the new values is evaluated by thePerformance (Classification) operator according to given criteria described below; thisoperator outputs a performance vector representing the results of the evaluation.

The training process is executed by a learning operator. For HMMs, this operatoris integrated into the program via the RapidMiner extension described in chapter 3.According to the two approaches of handling parallel observations described in chapter3.2, two possible operators exist: the Hidden Markov Model operator realizes the classicapproach of HMM learning, while the Hidden Markov Model (Parallel Observations)operator realizes the updated concept. Both operators shall be tested for all data sets -therefore, every given data set is combined with each of both HMM learning operatorsin a separate process. This results in a total of 12 possible combinations described intable 4.2.2

For every process, the delivered output values are the optimal combination of parame-ters according to the given performance criterion and the associated performance vector,learned HMM, original example set and updated example set.

2The name of each process resembles the name of the data set that is used: the name componentMinimal Size was removed as it is identical for all data sets that are used for the evaluation; forthose processes using the operator Hidden Markov Model (Parallel Observations), which realizes theupdated HMM concept, the name component NewOperator is added - all other processes use theHidden Markov Model operator realizing the classic HMM concept.

50

Chosen parameters

For the operators nested within the Parameter Optimization (Grid) operator, parametervalues can be tested for a given range or set as fixed constants. The chosen parametervalues are described in the following.

For the operator Cross Validation, only fixed parameter values are chosen. The numberof folds into which the example set is divided is set to 10, which is a standard value formachine learning processes. The random distribution of the examples across the foldswill be initialized with a local random seed - this way, the training sets that are availableto the sub-processes learning new HMMs are identical for every parameter combination,removing any positive or negative effects of particularly advantageous or disadvantageoustraining sets on the performance of the learned model. The local random seed is chosenas the default value 1992.

For both operators realizing the learning of a HMM-based classification model, amixture of fixed and varying parameter values is chosen. The following parameters arefixed and therefore are not part of the parameter optimization:

sequence attribute name The value that is chosen for this parameter is always theattribute au sequence.

MAP classification This option is always switched off: Maximum A Posteriori classi-fication should only be applied in cases where the distribution of classes withina given training set resembles the distribution of classes within the applicationdomain in which the learned system is supposed to be used. It cannot be deter-mined whether this is the case in a scenario considering facial expressions of pain,as it is unknown how high the share of pain expressions in the entirety of facialexpressions encountered in a given context will be.

use local random seed This option is always switched on: the performance of a HMMdepends on the values that are used during its initialization to a large extend. As nospecific values can be estimated for the given application scenario, they have to bechosen randomly. In order to remove positive or negative effects of advantageous ordisadvantageous random starting values, every single HMM within a classificationprocess shall be initialized with the same values.

local random seed For the 12 original processes, the default value (1992) is chosen.

For the following HMM parameters, varying parameter values will be tested in thecourse of the parameter optimization in order to find the best combination of values:

number of states The optimal size of a HMM (given as the number of states) is hardto estimate without any further information about the underlying natural processthat has to be modelled - in many cases only a trial and error approach can beapplied [RJ86]. The minimum number of states to be tested shall be 1. In a HMMconsisting of only 1 state, an observation has the exact same output probabilityat any given point in time, therefore no temporal information is modelled at all -

51

if 1 would be chosen as optimal number of states, this would imply that, contraryto the hypothesis of the present paper, sequential information is irrelevant forclassifying sequences of AUs, which would make HMMs an inadequate classifier inthis context. The maximum number of states to be tested can be chosen accordingto either one of the following two criteria:

1. The first approach chooses the maximum number of states as the length ofthe longest observation sequence in the training set, given as the numberof AUs or sets of AUs appearing at different points in time throughout thesequence. This way, no sequence has to visit the same state twice. However,just because a number of states represents the single steps of the longestsequence in an optimal way, it is not ensured that a subset of these statesrepresents the single steps of other, shorter sequences in an optimal way aswell; additionally, test sequences of a greater length may arise. Nevertheless,this approach provides a reasonable estimate of the maximum number ofstates which allows to execute the processes in an acceptable processing time.

2. An alternative approach chooses the maximum number of states as the sizeof the observation alphabet of a given classification task. This approach ismore exhaustive, as it is possible that every observation is associated with itsown state - this way, the HMM might end up as modelling a Markov ChainModel, where every state is associated with exactly one value, and whereonly the state transition probabilities model the stochastic process. For anextended observation alphabet including parallel observations, the maximumnumber of states can be chosen according to its size; in the case of parallelobservations modelled as sets, the situation is more complex: the alphabetitself only consists of single observations values, but in order to model everypossible output in a separate state, every possible combination of these valuesof arbitrary length would have to be considered: For N possible observationvalues, this would lead to 2N − 1 possible combinations, which would resultin an unacceptable increase in computation time even for smaller alphabets.

The possible values to be considered as maximum number of states can be foundin table 4.3. The 12 basic processes of the evaluation will realize the first approachbased on the length of the longest sequence in order to keep the computationtime within acceptable limits. The alternative approach based on the size of theobservation alphabet will only be considered in cases where the performance of theprocess constantly increases with a growing number of states and does not reacha local maximum before the highest number of states is evaluated.

Every process shall consider 8 possible numbers of states in the initial design.For the processes based on the Standard and the Replace representation of thedata basis, the numbers reach from 1 to 123; for processes based on the Reducerepresentation, they reach from 1 to 84 instead.

3Scale: quadratic; values: 1, 2, 3, 4, 5, 7, 9, 12.4Scale: linear; values: 1, 2, 3, 4, 5, 6, 7, 8.

52

alphabet size possible combinations longest sequence

Standard 70 - 12

Standard NewOperator 31 2147483647 12

Replace 28 - 12

Replace NewOperator 7 127 12

Reduce 14 - 8

Reduce NewOperator 6 63 8

Table 4.3.: Alphabet size, possible combinations and length of the longest sequence

iterations As equation 2.16 states, every iteration of the Baum-Welch-Algorithm leadsto a HMM that resembles the given training set better or at least not worse thanbefore. In this regard, a high number of iterations is desirable; on the other hand, aHMM resembling the training set too well might not be able to generalize to unseentest sequences which it is supposed to resemble as well - this is the overfitting effectdescribed in chapter 2.2.1. In order to avoid this problem, a wide variety of possiblenumbers of iterations shall be tested in the evaluation processes. Every processshall consider 16 possible values between 1 and 5005.

Performance criterion

The performance of a learned model can be measured by applying different criteria. Themain criterion, according to which the “best” model is chosen, has to be determined inthe Performance (Classification) operator. The standard performance vector producedby this operator already contains the following measures:

accuracy This value describes the percentage of examples in the example set that werecorrectly classified by the model. It is calculated by dividing the number of cor-rectly classified examples by the total number of examples in the example set.

precision This value is calculated for every class C and measures, how many examplespredicted as belonging to the class actually really belong to that class. It is cal-culated as the number of examples correctly classified as belonging to class C,divided by all examples that were classified as belonging to class C (no matter ifcorrectly or not).

recall This value is calculated for every class C and measures, how many examples of atest set belonging to the class were actually recognised as belonging to the class.It is calculated as the number of examples correctly classified as belonging to classC, divided by all examples in the example set belonging to class C.

5Scale: quadratic; values: 1, 3, 10, 21, 36, 56, 81, 110, 143, 181, 223, 269, 320, 376, 436, 500.

53

confidence pain (correct label) confidence disgust confidence neutral

classifier 1 0.8 0.1 0.1

classifier 2 0.4 0.3 0.3

Table 4.4.: Example of confidence values achieved by two different classifiers

For the evaluation described in the present thesis, an additional measure will be im-portant, which will also be the chosen main criterion: the soft margin loss. It is definedas “the average of all 1 - confidences for the correct label”.

The fact that a classifier predicted the correct label for a given example does notprovide any information about the degree of certainty with that the prediction wasperformed. The certainty of the prediction is represented in the confidence values, whichare the normalized output probability values achieved for a sequence, calculated over allpossible classes.

Consider the example represented in table 4.4: in this example, the true label, pain, ispredicted correctly by both classifiers, but the confidence with that classifier 1 performsthe prediction is twice as high as the confidence achieved by classifier 2. Both classifierswould be judged as equally successful when measured by accuracy, precision and recall,although the first one is twice as certain in its correct prediction. This quality shall beexpressed via the confidence values - more precisely, by the average confidence value forthe correct label (calculated over all examples that were classified); this also includesthe confidence values for the correct label in cases where an example was not correctlyclassified because another class achieved a higher confidence value.6

The average confidence value for the correct label over all examples cannot be chosenas main criterion for the performance directly in RapidMiner ; nevertheless, it can easilybe accessed indirectly via the criterion soft margin loss, which is calculated as the sumof the average confidences of the wrong labels, or simply as the average confidence ofthe true label subtracted from 1. Higher confidence values for the correct label meanbetter models and therefore have to be preferred - analogously, higher soft margin lossvalues mean lower confidence values for the correct label and therefore inferior models.The lower the soft margin loss, the better the model. A parameter optimization withsoft margin loss as main criterion will output the results for the parameter combinationachieving the lowest soft margin loss. The average confidence value for the correct labelis implicitly given as 1 - the achieved soft margin loss value.

6If, in the example described in table 4.4, the true label would be disgust instead of pain, classifier 2would achieve a confidence for the correct label that is three times as high as the value achieved byclassifier 1 - although both classifiers misclassify the sequence, the second one comes closer to thecorrect result, so it can be assumed that it models the underlying processes more accurately than thealternative classifier and therefore should still be preferred.

54

soft margin loss accuracy recall pain

Standard 0.509 51.18% 25.00%

Standard NewOperator 0.461 55.29% 44.44%

Standard TwoClass 0.316 74.71% 25.00%

Standard NewOperator TwoClass 0.315 70.59% 41.67%

Replace 0.466 59.41% 47.22%

Replace NewOperator 0.478 58.82% 36.11%

Replace TwoClass 0.312 74.71% 44.44%

Replace NewOperator TwoClass 0.319 74.12% 44.44%

Reduce 0.520 51.43% 46.15%

Reduce NewOperator 0.498 54.29% 53.85%

Reduce TwoClass 0.392 60.71% 57.69%

Reduce NewOperator TwoClass 0.379 67.50% 53.85%

Table 4.5.: Selected performance values for the local random seed 1992

4.2. Results

The results that were achieved during the execution of the evaluation processes will bedescribed and analysed in the following. The first part will focus on the results of the 12initial processes as described in the previous chapter, while the second part will includeadditional results achieved by further variations of certain parameter values.

4.2.1. Evaluating the basic processes

Selected performance values that were achieved during the execution of the 12 initialprocesses are presented in table 4.5: the most important criterion to be examined isof course the soft margin loss described in the previous chapter, which was chosen asmain criterion; the second value to be investigated is the accuracy - usually, a lower softmargin loss value corresponds with a higher accuracy value; finally, a third indicator forthe quality of a given model shall be examined closer: the recall value for the class pain.As emphasized earlier in the present paper, pain is a sensitive issue often involving athreat to the suffering person. Because of this, classifiers are required to perform paindetection as accurately as possible. The priority shall be to fully capture all expressionsof pain occurring in a given scenario, even when this leads to some false alarms in caseof doubt - therefore, a high recall for the class pain is desirable.

In general, the processes only lead to mediocre results. For classification tasks com-prising three possible classes, the soft margin loss lies between 0.461 and 0.520, whichimplies an average confidence for the correct label of 0.480 to 0.539 - for a classifiersimply guessing the classes to be predicted, this value would be 0.333, while a perfectclassifier predicting every class correctly with full certainty would achieve a value of1.000 as average confidence for the correct label. Processes only differentiating betweentwo classes achieve values of 0.312 to 0.392 (soft margin loss) or 0.608 to 0.688 (average

55

number of states iterations

Standard 12 36

Standard NewOperator 7 10

Standard TwoClass 12 269

Standard NewOperator TwoClass 7 36

Replace 12 500

Replace NewOperator 9 10

Replace TwoClass 12 223

Replace NewOperator TwoClass 9 1

Reduce 8 56

Reduce NewOperator 7 3

Reduce TwoClass 5 110

Reduce NewOperator TwoClass 4 3

Table 4.6.: Optimal parameter values for the local random seed 1992

confidence for correct label - guessing the class would resemble a value of 0.500). Ac-curacy values lie below 60.00% for three class problems (guessing probability: 33.33%)and below 75.00% for two class problems (guessing probability: 50.00%). 9 out of 12models do not even recognize half of the pain sequences (recall for pain < 50.00%), andnot a single model reaches a value of 60.00% or higher.

Comparing the results of the different processes to each other leads to the followingobservations:

1. Replacing pain-irrelevant AUs by the wild card AU I leads to a slight increase inthe overall performance7 and a considerable increase in the recall of pain - at leastin cases where the classic HMM concept is implemented. The reduced alphabet ofAUs and AU compounds seems to enable the model to represent sequences of painbetter than before. This observation resembles the results of the study conductedby Siebers et al. [SSS+16], where the replacement of pain-irrelevant AUs led to anincreased recall for the class pain too.

2. Using the updated HMM concept treating parallel observations as sets has a similareffect when applied to the standard representation of the data basis: the overallperformance increases slightly will the recall for the class pain increases notably.

3. Combining the replacement of irrelevant AUs with the application of the newHMM concept does not necessarily combine the achieved positive effects describedin observations 1 and 2: for the two class problem, the results are similar to theresults achieved by only applying one of both options; for the three class problem,the examined performance values are even worse than those achieved by applying

7In the context of this evaluation, an “increased” performance will always be understood as highervalues for accuracy and recall, but a lower value for the soft margin loss.

56

processing time

Standard 0:16:59

Standard NewOperator 1:51:42

Standard TwoClass 0:16:48

Standard NewOperator TwoClass 1:50:53

Replace 0:10:08

Replace NewOperator 1:20:44

Replace TwoClass 0:10:05

Replace NewOperator TwoClass 1:20:56

Reduce 0:02:10

Reduce NewOperator 0:18:00

Reduce TwoClass 0:02:15

Reduce NewOperator TwoClass 0:17:02

Table 4.7.: Processing time of the evaluation processes

only one option (although they are still better compared to the results achievedby applying neither of both options).

4. A further reduction of the observation alphabet by completely removing the pain-irrelevant AUs instead of replacing them almost always leads to the same effect:while the overall performance decreases8 compared to the other representations ofthe data basis, the recall for the class pain increases notably, even when it wasalready increased by applying one of the options described above9. A model onlybased on AUs that are pain-relevant seems to be more sensitive towards pain, buthas fewer criteria for differentiating between any other sentiments encoded in afacial expressions -this might explain the results achieved for this approach.

5. With respect to the optimal combinations of parameters found for the initial pro-cesses, which are represented in table 4.6, it has to be noted that processes apply-ing the updated HMM concept always achieve their optimal results for relativelysmaller values chosen for the number of states and the number of iterations com-pared to the equivalent processes applying the classic HMM concept. 5 out of6 processes applying the classic HMM concept achieve their best results for thehighest number of states that is possible during the parameter optimization, whileall other processes reach a (local) maximum for lower values. It can be assumedthat classification models realizing the updated HMM concept need fewer statesin order to model a process creating parallel observations, as they don’t have toconsider an extended alphabet of possible observations because of their ability tocalculate probabilities for parallel observations dynamically.

8A decreasing performance means lower accuracy and recall values, but higher soft margin loss values.9The only exception is the Reduce process, where the accuracy slightly increases compared to the

Standard process while the recall for the class pain slightly decreases compared to the Replace process.

57

6. Considering the processing time represented in table 4.7, it turns out clearly thatprocesses using the updated HMM concept needed an amount of time that is manytimes higher than the amount of time needed for the processes using the classicHMM concept. A major disadvantage of the updated HMM concept is that ithas to consider every probability distribution of every possible observation valuewhenever a probability has to be calculated for a given observation (independent ofthe observation’s size) - the greater the observation alphabet is, the more does thisprocedure slow down the whole process. As a growing number of states constitutinga HMM results in a notable increase of the required processing time as well, theapproach of the updated HMM concept turns out to be only applicable for amodest number of states and observations to be considered. At least this may notbe a major issue due to the fact that processes using the updated concept seemto need a smaller number of states for achieving good results anyway (as shown inobservation 5).

4.2.2. Further variations of parameter values

In the following, some additional parameter values for the initial processes will be con-sidered in order to answer some question that may arise from the results described inthe previous chapter. For those processes that reached their best results for the highestpossible number of states used for the HMM initialization, it shall be tested whether al-lowing a higher value will lead to a further increase in the achieved results. Additionally,the initial processes will be tested again with alternative local random seeds in order tofind out how strong the influence of the randomly chosen starting values on the overallperformance of a process can be.

Increasing the maximum number of states

There were five processes that didn’t reach a local maximum before considering thehighest number of states available via parameter optimization. For those processes, thealternative approach for choosing the maximum number of states will be applied in orderto find out how this influences the performance. For processes using the classic HMMconcept, the value to be tested is the size of the observation alphabet that is used in thecourse of the process; for processes using the updated HMM concept, this would be thenumber of possible combinations of all single observation symbols, which would result inan incredibly long processing time. As all these processes already reached local maximain the previous processes anyway, they won’t be considered in the following.

All processes to be considered used the HMM operator realising the classic approach.Two of them were two class versions of the three basic processes - they will be excludedfrom the investigation for the moment, which leaves the three processes Standard, Replaceand Reduce. Each of them has another observation alphabet size to consider as newmaximum number of states, which can be seen in table 4.3. In each case, 6 new valuesshall be compared to the previous maximum number of states in a separate process:

58

soft margin loss accuracy recall pain

Standard Extended 0.486 55.29% 22.22%

Replace Extended 0.466 59.41% 47.22%

Reduce Extended 0.501 51.25% 50.00%

number of states iterations processing time

Standard Extended 70 36 7:00:23

Replace Extended 12 500 1:02:18

Reduce Extended 13 21 0:06:56

Table 4.8.: Selected results for the extended processes

• The process Standard Extended considers values between 12 and 7010.

• The process Replace Extended considers values between 12 and 2811.

• The process Reduce Extended considers values between 8 and 1412.

The results of the processes can be found in table 4.8. The Standard Extended processachieves the best results for the maximum number of states again. While the overallperformance slightly increases, the recall for the class pain slightly decreases. The Re-place Extended process chooses the exact same values as the original Replace process,therefore achieving the exact same results. The log shows that the performance decreasesafter choosing a number of states higher than 12, increases again at a certain point andreaches a local maximum for 24 states (which is nearly as high as the local maximumachieved for 12 states), and decreases again afterwards. The Reduce Extended processis the only one achieving a decreased soft margin loss and an increased recall for theclass pain by applying a higher number of states; according to the log, the performanceincreases and decreases alternately with a growing number of states and reaches sev-eral local maxima. Ultimately, allowing a higher maximum number of states during theparameter optimization does not lead to a notably increased performance.

Testing alternative local random seeds

In order to find out how strong the results achieved for the 12 initial processes are influ-enced by the initial HMMs used for the training process, which were always initializedwith values created according to the local random seed 1992, the processes were executedagain with alternative local random seeds.

The alternative local random seeds to be tested were 1225 and 1564. The results forthe value 1225 can be found in table B.1 and B.2, while table B.3 and B.4 show theresults for the value 1564. Comparing the results with the observations formulated inchapter 4.2.1 leads to the following insights:

10Scale: quadratic; values: 12, 14, 18, 27, 38, 52, 70.11Scale: quadratic; values: 12, 13, 15, 17, 20, 24, 28.12Scale: linear; values: 8, 9, 10, 11, 12, 13, 14.

59

1. The replacement of pain-irrelevant AUs still seems to increase the overall per-formance as well as the recall for the class pain in general, although there areexceptions13.

2. The usage of the updated HMM concept almost always slightly increases the overallperformance and considerably increases the recall for the class pain (even leadingto the highest recall for pain across all processes14) when applied to the standardrepresentation of the data basis15.

3. The combination of the replacement of pain-irrelevant AUs and the application ofthe updated HMM concept still does not guarantee a significant increase in theresults compared to choosing only one of those options - but still, it is alwaysbetter than choosing no option at all, and there are even examples where thecombination of both options achieves better results than the alternative processesapplying only one or no option16 (one of them even achieving the highest accuracyacross all processes17).

4. As before, the complete removal of all pain-irrelevant AUs leads to a decrease inthe overall performance and a substantial increase in the recall for pain comparedto equivalent processes based on the Standard or Replace representations18.

5. Considering the optimal parameters, processes using the updated HMM conceptnever use more but often fewer states to initialize HMMs with; they often, but notalways, need fewer iterations of the Baum-Welch-Algorithm in order to achievetheir optimal results.

Most of the findings worked out for the 12 initial processes are confirmed by the resultsthat are achieved by testing alternative local random seeds, although there are someexceptions for almost every apparent regularity in the results. The range of performancevalues achieved for the different processes remains the same across the different localrandom seeds more or less. In general, the choice of the local random seed does notseem to have a substantial impact on the performance of a process.

13For the local random seed 1564, the Replace process achieves a lower recall for pain compared to theStandard process, while the Replace TwoClass process achieves a higher soft margin loss and a loweraccuracy compared to the Standard TwoClass process.

14It is 58.33% and occurs in the process Standard NewOperator for the local random seed 1225.15Some exceptions occur though: for the process Standard newOperator TwoClass the accuracy slightly

decreases for the local random seeds 1225 and 1564; for the latter, the recall for pain does not increaseadditionally.

16These examples are the process Replace NewOperator TwoClass for the local random seed 1225, whichreceives better results than its alternatives in every respect, and the process Replace NewOperatorfor the local random seed 1564, where only the recall for pain is slightly decreased compared to theprocess Standard NewOperator, while all other values are better than the alternatives.

17It is 80.00% and occurs in the process Replace NewOperator TwoClass for the local random seed 1225.18With only a few exceptions, like the process Reduce NewOperator for the local random seed 1225,

where the opposite effect is the case, while for the same local random seed the process Re-duce NewOperator TwoClass achieves worse values in every category compared to the equivalentprocesses.

60

5. Conclusion

Automated facial expression analysis is an elaborate process consisting of many steps;the present thesis highlighted the classification step, which is supposed to correctlyinterpret a facial expression based on a given representation. The representation usedin this context were sequences of Action Units decoded via the Facial Action CodingSystem - the goal of the present thesis was the creation of a classification system able todifferentiate between AU sequences decoding sentiments of pain and sequences decodingother sentiments. The system was supposed to be based on the machine learning conceptof Hidden Markov Models in order to be able to process sequential data. An extensionfor the data mining software RapidMiner was implemented based on the jahmm libraryin order to realize the classification system.

The first objective of the present thesis was to solve the problem of the occurrence ofparallel observations, which cannot be processed by HMMs in their basic implementation.Two approaches were considered: in the first approach, the data basis is manipulated inorder to represent combinations of parallel observations as self-contained symbols withinthe alphabet of possible observations - this way, observation sequences containing parallelobservations can be processed by the original HMM concept; in the second approach,parallel observations are treated as sets of observations - an updated version of theHMM concept was designed that calculates the probabilities of parallel observationsbased on the probabilities of their single components. The first approach is simplerand faster, but unable to model dependencies between combinations of observationsand their components, and therefore unable to process previously unseen combinationsof observations. The second approach is able to process every possible combinationof observations by modelling the dependencies that are missing in the first approach,but requires greater computational resources and has to deal with the problem of thetheoretical possibility of empty observations sets.

The second objective of the present thesis was the evaluation of the HMM-based classi-fication system as a classifier for facial expressions: both approaches of handling parallelobservations were integrated into the RapidMiner extension enabling HMM-based clas-sification and evaluated by applying them to the sequences of Action Units. Differentrepresentations of the data basis were used: besides the standard representation, twoalternative representations considered the highlighting of pain-relevant AUs presentedin the literature concerned with pain assessment: the first alternative replaced all pain-irrelevant AUs by a single surrogate AU; the second alternative removed them completelyfrom the sequences.

The evaluation showed that a HMM-based classification system as implemented in thecourse of the present thesis achieves modest results in classifying facial expressions of painand other sentiments. Replacing pain-irrelevant AUs by a surrogate AU usually leads to

61

better performance values, especially considering the recall for the class pain; the sameeffect is achieved when the updated HMM concept processing parallel observations as setsis applied; combining both approaches does not necessarily increase this positive effect.Completely removing the pain-irrelevant AUs leads to a decrease of most performancevalues while further increasing the recall for the class pain. It can be assumed that, themore sensitive a model is towards the detection of pain, the less able is it to identify anddistinguish between other sentiments. Processes applying the updated HMM conceptrequire a lot more computation time but need fewer states and iterations of the learningprocess in order to achieve their optimal results. For all these apparent regularities,several exceptions were observed across the evaluation processes.

Future research based on these findings might consider additional approaches of pro-cessing parallel observations in the course of HMM-based classification. The updatedHMM concept presented in this work could be optimized with regard to the requiredcomputation time for probability calculation or the problem of empty observations setsbeing possible. The evaluation of the classification system could be extended by using alarger data basis or testing additional combinations of parameter values.

If it would be possible to create a system that is able to recognize pain and othersentiments by interpreting facial expressions with a sufficient degree of certainty, sev-eral practical applications are imaginable. One example that was already mentionedis hospital treatment - systems observing the facial expressions of patients (especiallythose patients unable to communicate their emotional condition verbally) could detectoccurrences of pain quickly and initiate action to provide help to the suffering persons.Another application field with a promising future is the interaction between humans andmachines: profound knowledge about the constitution of facial expressions as sequencesof Action Units might improve a realistic generation of emotions in avatars or humanoidrobots [SSS+12]1. Possible applications might reach from a simple simulation of emo-tions in a specific context of interaction up to the implementation of a real emotionalunderstanding resulting in adaptive behaviour: “if we want computers to be genuinelyintelligent and to interact naturally with us, we must give computers the ability torecognize, understand, and even to have and express emotions” [MKR09, p. 53].

1Studies trying this were conducted by Fabri et al. [FMH04] and Paleari & Lisetti [PL06].

62

Bibliography

[Akh14a] M. Fareed Akhtar. k-nearest neighbor classification i. In Markus Hofmannand Ralf Klinkenberg, editors, RapidMiner, pages 33–43. CRC Press, BocaRaton, 2014.

[Akh14b] M. Fareed Akhtar. k-nearest neighbor classification ii. In Markus Hofmannand Ralf Klinkenberg, editors, RapidMiner, pages 45–51. CRC Press, BocaRaton, 2014.

[ALC+09] Ahmed Bilal Ashraf, Simon Lucey, Jeffrey F. Cohn, Tsuhan Chen, ZaraAmbadar, Kenneth M. Prkachin, and Patricia E. Solomon. The painfulface - pain expression recognition using active appearance models. Imageand Vision Computing, 27(12):1788–1796, 2009.

[Bis06] Christopher M. Bishop. Pattern Recognition and Machine Learning.Springer, New York, 2006.

[Chi13] Andrew Chisholm. Exploring Data with RapidMiner. Packt Publishing,Birmingham, UK, 2013.

[Coh10] Jeffrey F. Cohn. Advances in behavioral science using automated facialimage analysis and synthesis. IEEE Signal Processing Magazine, 27(6):128–133, 2010.

[CPE01] Kenneth D. Craig, Kenneth M. Prkachin, and Ruth Eckstein Grunau. Thefacial expression of pain. In Dennis C. Turk and Ronald Melzack, editors,Handbook of Pain Assessment, pages 153–169. New York, 2001.

[DPM04] Kathleen S. Deyo, Kenneth M. Prkachin, and Susan R. Mercer. Devel-opment of sensitivity to facial expression of pain. PAIN, 107(1-2):16–21,2004.

[EF71] Paul Ekman and Wallace V. Friesen. Constants across cultures in the faceand emotion. Journal of Personality and Social Psychology, 17(2):124–129,1971.

[Ekm88] Paul Ekman. Gesichtsausdruck und Gefuhl: 20 Jahre Forschung von PaulEkman. Junfermann, Paderborn, 1988.

[Fin14] Gernot A. Fink. Markov Models for Pattern Recognition: From Theory toApplications. Springer, London, 2 edition, 2014.

63

[FL03] Beat Fasel and Juergen Luettin. Automatic facial expression analysis: asurvey. Pattern Recognition, 36(1):259–275, 2003.

[FMH04] Marc Fabri, David Moore, and Dave Hobbs. Mediating the expression ofemotion in educational collaborative virtual environments: an experimentalstudy. Virtual Reality, 7(2):66–81, 2004.

[Fra06] Jean-Marc Francois. jahmm-0.6.1-userguide, 2006.

[IAS86] IASP Subcommittee on Classification. Pain terms: A current list withdefinitions and notes on usage. PAIN, 24(Supplement 1):215–221, 1986.

[JVDS14] Milos Jovanovic, Milan Vukicevic, Boris Delibasic, and Milija Suknovic.Using rapidminer for research: Experimental evaluation of learners. InMarkus Hofmann and Ralf Klinkenberg, editors, RapidMiner, pages 439–454. CRC Press, Boca Raton, 2014.

[KD15] Vijay Kotu and Bala Deshpande. Predictive Analytics and Data Mining:Concepts and Practice with RapidMiner. Morgan Kaufmann, Waltham,2015.

[KML+16] Anna Julia Karmann, Christian Maihofner, Stefan Lautenbacher, WolfgangSperling, Johannes Kornhuber, and Miriam Kunz. The role of prefrontalinhibition in regulating facial expressions of pain: A repetitive transcranialmagnetic stimulation study. The Journal of Pain, 17(3):383–391, 2016.

[KMS+09] Miriam Kunz, Veit Mylius, Siegfried Scharmann, Karsten Schepelmann,and Stefan Lautenbacher. Influence of dementia on multiple componentsof pain. European Journal of Pain, 13(3):317–325, 2009.

[KSH+07] Miriam Kunz, Siegfried Scharmann, Uli Hemmeter, Karsten Schepelmann,and Stefan Lautenbacher. The facial expression of pain in patients withdementia. PAIN, 133(1-3):221–228, 2007.

[LeR82] Linda LeResche. Facial expression in pain: A study of candid photographs.Journal of Nonverbal Behavior, 7(1):46–56, 1982.

[LH12] Seyed Mehdi Lajevardi and Zahir M. Hussain. Automatic facial expressionrecognition: feature extraction and selection. Signal, Image and VideoProcessing, 6(1):159–169, 2012.

[LKCL98] James J. Lien, Takeo Kanade, Jeffrey F. Cohn, and Ching-Chung Li. Au-tomated facial expression recognition based on facs action units. In ThirdIEEE International Conference on Automatic Face and Gesture Recogni-tion. 1998.

[LMHBG07] Amanda C. Lints-Martindale, Thomas Hadjistavropoulos, Bruce Barber,and Stephen J. Gibson. A psychophysical investigation of the facial action

64

coding system as an index of pain variability among older adults with andwithout alzheimer’s disease. Pain Medicine, 8(8):678–689, 2007.

[Mie14] Ingo Mierswa. Getting used to rapidminer. In Markus Hofmann and RalfKlinkenberg, editors, RapidMiner, pages 19–30. CRC Press, Boca Raton,2014.

[Mit97] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.

[MKR09] Lori Malatesta, Kostas Karpouzis, and Amaryllis Raouzaiou. Affectiveintelligence: The human face of ai. In Max Bramer, editor, Artificial In-telligence, pages 53–70. Springer, Berlin, 2009.

[PBM94] Kenneth M. Prkachin, Sandra Berzins, and Susan R. Mercer. Encodingand decoding of pain expressions: a judgement study. PAIN, 58(2):253–259, 1994.

[PC95] Kenneth M. Prkachin and Kenneth D. Craig. Expressing pain: The com-munication and interpretation of facial pain signals. Journal of NonverbalBehavior, 19(4):191–205, 1995.

[PL06] Marco Paleari and Christine Lisetti. Psychologically grounded avatars ex-pressions. In Dirk Reichardt, Paul Levi, and John-Jules C. Meyer, editors,Proceedings of the 1st Workshop on Emotion & Computing - Current Re-search and Future Impact, pages 39–42. Bremen, 2006.

[Prk92] Kenneth M. Prkachin. The consistency of facial expressions of pain: acomparision across modalities. PAIN, 51(3):297–306, 1992.

[RJ86] L. R. Rabiner and B. H. Juang. An introduction to hidden markov models.IEEE ASSP Magazine, 3(1):4–16, 1986.

[Sch82] Klaus R. Scherer. Emotion as a process: Function, origin and regulation.Social Science Information, 21(4-5):555–570, 1982.

[SES13] Michael Siebers, Tamara Engelbrecht, and Ute Schmid. On the relevanceof sequence information for decoding facial expressions of pain and disgust:An avatar study. In Dirk Reichardt, editor, Proceedings of the 7th Work-shop on Emotion & Computing - Current Research and Future Impact,pages 3–9. Koblenz, 2013.

[SKLS09] Michael Siebers, Miriam Kunz, Stefan Lautenbacher, and Ute Schmid.Classifying facial pain expressions: Individual classifiers vs. global clas-sifiers. In Dirk Reichardt, editor, Proceedings of the 4th Workshop onEmotion & Computing - Current Research and Future Impact. Paderborn,2009.

[SPF97] Patricia E. Solomon, Kenneth M. Prkachin, and Vern Farewell. Enhancingsensitivity to facial expression of pain. PAIN, 71(3):279–284, 1997.

65

[SSB08] Samuel Strupp, Norbert Schmitz, and Karsten Berns. Visual-based emo-tion detection for natural man-machine interaction. In Andreas R. Dengel,Karsten Berns, Thomas M. Breuel, Frank Bomarius, and Thomas R. Roth-Berghofer, editors, KI 2008: Advances in Artificial Intelligence. Springer,Berlin, 2008.

[SSS+12] Ute Schmid, Michael Siebers, Dominik Seuß, Miriam Kunz, and StefanLautenbacher. Applying grammar inference to identify generalized patternsof facial expressions of pain. In Jeffrey Heinz, Colin de La Higuera, andTim Oates, editors, Proceedings of the 11th International Conference onGrammatical Inference, pages 183–188. Washington DC, 2012.

[SSS13] Christoph Stocker, Michael Siebers, and Ute Schmid. Erkennung von se-quenzen mimischer schmerzausdrucke durch genetische programmierung.In Andreas Henrich and Hans-Christian Sperker, editors, LWA 2013:Workshop Proceedings, pages 117–120. Bamberg, 2013.

[SSS+16] Michael Siebers, Ute Schmid, Dominik Seuß, Miriam Kunz, and StefanLautenbacher. Characterizing facial expressions by grammars of action unitsequences: A first investigation using abl. Information Sciences, 329:866–875, 2016.

66

A. Alphabets

Alphabet of possible observations for the Standard approach:∑= AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU10, AU12, AU14, AU15, AU16,

AU17, AU18, AU19, AU20, AU22, AU23, AU24, AU25, AU26, AU28, AU29, AU30,AU31, AU32, AU34, AU37, AU38, AU39, AU43, AU1-2, AU1-4, AU1-6, AU1-18, AU4-6,AU4-7, AU4-10, AU4-15, AU4-17, AU4-23, AU4-26, AU4-34, AU6-7, AU6-12, AU6-26,AU7-10, AU7-12, AU7-17, AU7-26, AU7-43, AU10-14, AU14-28, AU15-17, AU16-17,AU18-19, AU19-28, AU23-28, AU24-26, AU25-26, AU26-30, AU4-10-17, AU4-10-43,AU4-15-17, AU6-7-26, AU10-15-17, AU4-6-7-9-10, AU4-6-7-9-10-14, AU4-6-7-9-10-24,AU6-7-9-10-17-23-43

Alphabet of possible observations for the Standard NewOperator approach:∑= AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU10, AU12, AU14, AU15, AU16,

AU17, AU18, AU19, AU20, AU22, AU23, AU24, AU25, AU26, AU28, AU29, AU30,AU31, AU32, AU34, AU37, AU38, AU39, AU43

Alphabet of possible observations for the Replace approach:∑= AU4, AU6, AU7, AU9, AU10, AU43, AUI, AU4-6, AU4-7, AU4-10, AU4-I, AU6-7,

AU6-I, AU7-10, AU7-43, AU7-I, AU10-I, AUI-4, AUI-6, AUI-I, AU4-10-43, AU4-10-I,AU4-I-I, AU6-7-I, AU10-I-I, AU4-6-7-9-10, AU4-6-7-9-10-I, AU6-7-9-10-I-I-43

Alphabet of possible observations for the Replace NewOperator approach:∑= AU4, AU6, AU7, AU9, AU10, AU43, AUI

Alphabet of possible observations for the Reduce approach:∑= AU4, AU6, AU7, AU9, AU10, AU43, AU4-6, AU4-7, AU4-10, AU6-7, AU7-10,

AU7-43, AU4-10-43, AU4-6-7-9-10

Alphabet of possible observations for the Reduce NewOperator approach:∑= AU4, AU6, AU7, AU9, AU10, AU43

Figure A.1.: Alphabets of possible observations for the different evaluation approaches

67

B. Additional evaluation results

soft margin loss accuracy recall pain

Standard 0.512 51.76% 22.22%

Standard NewOperator 0.462 55.29% 58.33%

Standard TwoClass 0.326 73.53% 19.44%

Standard NewOperator TwoClass 0.295 71.18% 47.22%

Replace 0.471 59.41% 38.89%

Replace NewOperator 0.495 57.06% 50.00%

Replace TwoClass 0.330 73.53% 36.11%

Replace NewOperator TwoClass 0.282 80.00% 52.78%

Reduce 0.519 46.96% 50.00%

Reduce NewOperator 0.467 61.25% 46.15%

Reduce TwoClass 0.366 65.54% 53.85%

Reduce NewOperator TwoClass 0.357 67.86% 42.31%

Table B.1.: Selected performance values for the local random seed 1225

number of states iterations

Standard 12 56

Standard NewOperator 12 81

Standard TwoClass 9 223

Standard NewOperator TwoClass 9 376

Replace 7 81

Replace NewOperator 4 36

Replace TwoClass 12 500

Replace NewOperator TwoClass 3 3

Reduce 7 56

Reduce NewOperator 6 3

Reduce TwoClass 6 110

Reduce NewOperator TwoClass 2 21

Table B.2.: Optimal parameter values for the local random seed 1225

68

soft margin loss accuracy recall pain

Standard 0.517 53.53% 36.11%

Standard NewOperator 0.482 54.12% 44.44%

Standard TwoClass 0.308 76.47% 22.22%

Standard NewOperator TwoClass 0.292 72.94% 22.22%

Replace 0.487 57.65% 30.56%

Replace NewOperator 0.478 58.24% 41.67%

Replace TwoClass 0.341 70.59% 33.33%

Replace NewOperator TwoClass 0.299 75.29% 36.11%

Reduce 0.475 51.96% 57.69%

Reduce NewOperator 0.508 52.68% 42.31%

Reduce TwoClass 0.381 69.11% 53.85%

Reduce NewOperator TwoClass 0.358 70.36% 46.15%

Table B.3.: Selected performance values for the local random seed 1564

number of states iterations

Standard 12 10

Standard NewOperator 2 3

Standard TwoClass 12 376

Standard NewOperator TwoClass 12 436

Replace 12 436

Replace NewOperator 3 10

Replace TwoClass 12 110

Replace NewOperator TwoClass 2 3

Reduce 6 56

Reduce NewOperator 3 81

Reduce TwoClass 6 21

Reduce NewOperator TwoClass 2 3

Table B.4.: Optimal parameter values for the local random seed 1564

69

C. Media content

The present thesis comes with an included media comprising the following content:

• The present thesis as electronic file (Masterarbeit KogSys Gromowski.pdf).

• The implementation of the RapidMiner extension realizing HMM-based classifica-tion learning (folder Implementation).

– The classes that were implemented in the course of the present thesis can befound in the directory src/main/java/com/rapidminer/extension.

• The RapidMiner processes realizing the 12 initial evaluation processes (folderRapidminer Processes).

• Representations of the HMM-based classification models learned for the 12 initialevaluation processes (folder Hidden Markov Models).

• The performance vectors for the 12 initial evaluation processes (folder Perfor-mance Vectors).

• The optimal parameters found for the 12 initial evaluation processes (folder Opti-mal Parameters).

• The log files created for the 12 initial evaluation processes (folder Logs).

70

Erklarung

Ich erklare hiermit gemaß § 17 Abs. 2 APO, dass ich die vorstehende Masterarbeit selbst-standig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzthabe.

(Datum) (Unterschrift)