hidden markov model based recognition of german … · hidden markov model based recognition of...

SAARLAND UNIVERSITY

Hidden Markov Model Based

Recognition of German Finger Spelling

Using the Leap Motion

Submitted by:

Tengfei Wang

Supervisor:

Dr.Alexis Heloir

A thesis submitted in partial fulfillment for the

degree of Master of Science

in the

Faculty of Natural Sciences and Technology I

Department of Computer and Communication Technology

September 2015

http://www.uni-saarland.de

http://www.cs.uni-saarland.de/


SAARLAND UNIVERSITY

Abstract

Faculty of Natural Sciences and Technology I

Department of Computer and Communication Technology

Master of Science

by Tengfei Wang

Recently, the appearance of novel acquisition devices like the Leap Motion Controller

drew a lot of attention in the field of gesture recognition. It is explicitly targeted

to hand gesture recognition and provides access to the position of the fingertips and

the orientation of the hand. This new device might be an interesting opportunity for

robust gesture recognition. We would therefore like to evaluate the capabilities of the

Leap Motion for recognizing complex gestures like the one that are used in German

finger spelling. In this thesis, we present the German finger spelling recognition system

(GFRS), which is capable of recognizing letter-to-letter transitions in real time. In this

system, instead of modelling static posture for each letter, letter to letter transitions

are modeled using hidden Markov models (HMMs). The models are trained using the

data recorded by the Leap Motion during the performance of transitions. In addition to

the statistical model, a bigram language model is also used to reduce the size of model

database. Experiments are conducted on both isolated and continuous recognition. For

isolated recognition, the system could achieve an accuracy of 80% using a vocabulary of

100 transitions and the number can be further improved to 89.96% when the vocabulary

size if reduced to 30. In terms of continuous recognitions, the accuracy is 68% when

testing on a vocabulary of 10 commonly used surnames in Germany.

http://www.uni-saarland.de



Acknowledgements

This thesis required a significant amount of research and programming. The imple-

mentation would not have been possible without the support of many individuals and

organizations. Therefore I would like to extend our sincere gratitude to all of them.

First of all I am thankful to my supervisor, Dr.Alexis Heloir, for providing necessary

guidance concerning background knowledge as well as suggestions to solve problems that

are met during the project implementation. I would also like to show my gratitude to

Prof. Dr. Antonio Kruger, who accepted to review this work.

I am grateful to Deutsche Forschungszentrum fur Kunstliche Intelligenz (DFKI) for

provision of hardware support in the implementation. Additionally, I extend my thanks

to the Jahmm developers who provide the source code on which our project is based.

Last but not least, I would like to express my sincere thanks toward my family and my

friends for their kind encouragement, which helped me to complete this project.

ii

Contents

Abstract i

Acknowledgements ii

List of Figures v

List of Tables vii

Abbreviations viii

Symbols ix

1 Introduction 1

1.1 German Dactylology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Introduction to State of the Art Gesture Recognition . . . . . . . . . . . . 2

1.2.1 General Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Input Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.3 Static Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . 4

1.2.4 Dynamic Gesture Recognition . . . . . . . . . . . . . . . . . . . . . 5

1.3 Our Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 The Proposed Method 8

2.1 The Leap Motion Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Letter to Letter Transition . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Basic Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Parameter Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 Three Basic Problems for HMM . . . . . . . . . . . . . . . . . . . 12

2.3.3.1 Problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3.2 Problem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3.3 Problem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.4 HMM Used in Our System . . . . . . . . . . . . . . . . . . . . . . 15

2.3.4.1 Topology - Left to Right . . . . . . . . . . . . . . . . . . 15

2.3.4.2 Emission Probability - Mixture of Multivariate Gaussians 15

iii

iv

3 System Architecture 17

3.1 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 The Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 HMM Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1.1 Sequence Segmentation . . . . . . . . . . . . . . . . . . . 22

3.2.1.2 Transition Probabilities (A) Initialization . . . . . . . . . 24

3.2.1.3 Emission Probabilities (B) Initialization . . . . . . . . . 24

3.2.2 Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 The Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 System Implementation 31

4.1 Jahmm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Main Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.2 Extension to Jahmm . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.3 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Class Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.1 Data Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.2 HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.3 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.2 Continuous Recognition . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Experimental Results 44

5.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.1 Isolate Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.1.1 Experiment on Feature Vector . . . . . . . . . . . . . . . 46

5.2.1.2 Experiment on Number of States . . . . . . . . . . . . . . 48

5.2.1.3 Experiment on Size of HMM Database . . . . . . . . . . 49

5.2.2 Continuous Recognition . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Conclusion and Further Work 53

A JSON Structure from the Leap 55

B The 100 Transitions 57

C Class Diagrams 59

Bibliography 62

List of Figures

1.1 German finger spelling alphabet from http://www.visuelles-denken.

de/Schnupperkurs3.html. . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 The right-handed Cartesian coordinate system of the Leap Motion. . . . . 9

2.2 The skeletal tracking model of the Leap Motion enables us to access theposition of each joint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 The directional information represented by arrows that the Leap Motioncan provide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 The word ”owl” consists two transitions: o to w and w to l. Picturesource: http://www.cafepress.com/+finger-spelling+journals . . . 10

2.5 A sample of a 5-state hidden Markov model. . . . . . . . . . . . . . . . . . 11

3.1 Block diagram of system architecture. . . . . . . . . . . . . . . . . . . . . 17

3.2 A to B transition changes only the hand shape. . . . . . . . . . . . . . . . 19

3.3 Signs of M and N have slight differences. . . . . . . . . . . . . . . . . . . . 19

3.4 An illustration of the concept of finger openness. . . . . . . . . . . . . . . 19

3.5 Signs of D and G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 Signs of I and J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.7 Flow chart of the training pipeline. . . . . . . . . . . . . . . . . . . . . . . 22

3.8 An illustration of entropy estimation with W=4. . . . . . . . . . . . . . . 23

3.9 Flow chart of the recognition procedure. . . . . . . . . . . . . . . . . . . . 28

3.10 A screen shot of the GBC matrix. . . . . . . . . . . . . . . . . . . . . . . 29

3.11 The occurrence probability of transitions in a 16026 sentences corpus. . . 29

4.1 Workspace directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 The main frame of GFSR. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 A popped up frame that gives instructions to the user about which tran-sition to perform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 The training panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 A popped up frame in which the user can configure the feature vector andstates number of the HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.6 The isolated recognition panel. . . . . . . . . . . . . . . . . . . . . . . . . 41

4.7 Three threads for continuous recognition. . . . . . . . . . . . . . . . . . . 43

5.1 Sign of letter F in the Leap Motion visualizer. . . . . . . . . . . . . . . . . 51

5.2 Signs of letter E is misrepresented. . . . . . . . . . . . . . . . . . . . . . . 51

5.3 One extended finger recognized as two. . . . . . . . . . . . . . . . . . . . . 51

C.1 Class diagram of recognition. . . . . . . . . . . . . . . . . . . . . . . . . . 59

v

http://www.visuelles-denken.de/Schnupperkurs3.html


http://www.cafepress.com/+finger-spelling+journals

vi

C.2 Class diagram of training. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

C.3 Class diagram of data recording. . . . . . . . . . . . . . . . . . . . . . . . 61

List of Tables

3.1 Candidate features used in our system. . . . . . . . . . . . . . . . . . . . . 21

3.2 Top ten most frequently used letter pairs . . . . . . . . . . . . . . . . . . 30

4.1 Classes used in our system from Jahmm. . . . . . . . . . . . . . . . . . . 33

4.2 Extra classes added to Jahmm . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1 9 feature vectors used in the experiments. . . . . . . . . . . . . . . . . . . 46

5.2 Experimental results on different feature vectors. ”FV”, the feature vectorused. ”DIM”, the dimension of the feature vector. ”NS”, state numberof the HMM. ”T1”, time (ms) needed to train the 100 models. ”T2”,average time (ms) needed to run an isolated recognition. ”DS”, size ofthe HMM database the experiment based on. . . . . . . . . . . . . . . . . 47

5.3 Experimental results on different number of states when using featurevector ”E” and ”G”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 Experiments results on different size of HMM database. . . . . . . . . . . 49

5.5 31 letter to letter transitions in the ten names. . . . . . . . . . . . . . . . 50

5.6 Experimental results on continuous recognition. . . . . . . . . . . . . . . 50

B.1 The 100 transitions with their occurrence probabilities. . . . . . . . . . . 58

vii

Abbreviations

GFRS German Finger Spelling Recognition System

HMM Hidden Markov Model

ASL American Sign Language

DGS Deutsche Gebarden Sprache

SVM Support Vector Machines

RF Random Forest

HIC Human Computer Interaction

FSM Finite State Machines

AFR Auslan Finger-spelling Recognizer

JSON JavaScript Object Notation

API Application Programming Interface

GMM Gaussian Mixture Model

HOC Hand Orientation Change

HPC Hand Position Change

WCSS Within-Cluster Sum of Squares

PDF Probability Density Function

EM Expectation-Mximization

GBC German Bigram Counter

IDE Integrated Development Environment

viii

Symbols

N number of states

T number of observations

St hidden state at time t

O a sequence of observations

Ot the observation at time t

aij the probability of the transition from state i to j

A the transition probability matrix

bj(Ot) the probability of observing k in state j at time t

B the emission probability of an HMM

λ the parameter set of an HMM

π the initial state distribution

ωi the ith component weight of a Gaussian Mixture Model

µi the mean vector of the ith component

Σi the covariance matrix of the ith component

ψ the parameter set of a GMM

OK the set of K observation sequences

O(i) the ith sequence of an observation sequences set

ix

To My Parents.

x

Chapter 1

Introduction

There are tens of millions of Deaf1 and hard of hearing people all over the world. For this

special group, sign languages are mainly used for communication purpose . However, the

linguistic structure of sign language is different from the linguistic structure of spoken

languages, such as grammar, vocabulary and word order [2]. Furthermore, many Deaf

people suffer from literacy deficiency [2, 3], it is hard for them to read text in a fluid

manner, because written text is often a transcription of oral language and Deaf have

huge difficulties acquiring the oral language (lack of acoustic feedback). All these facts

make them hard to learn and to communicate with the rest of the society. Therefore, a

system that can make written information more accessible to Deaf people is needed.

Recently, a number of human computer interaction (HCI) devices appeared on the mar-

ket (e.g., Kinect, the Leap Motion). These devices use field-proven methods to capture

human body’s motion. They are initially developed for gaming applications, but have

also been widely used in interaction research, rehabilitation [4], computer vision [5] and

3d reconstruction [6, 7]. One might however wonder if these devices can also be applied

to improve Deaf people’s life. Actually, companies such as ”Motion Savy”2 claimed that

they could develop an application capable of translating isolate word from American

Sign Language (ASL) to voice message in a conversational speed using the newly in-

troduced Leap Motion Controller [8]. However, the product has not come on market

yet and we could not find any scientific publications to support their claim. Thus, we

would like to evaluate how relevant the Leap Motion could be in the context of Sign

Language recognition. To make things easier, our focus will be on German finger spelling

recognition. Finger spelling is a small subset of Sign Language and consists of one or

1We follow the convention of writing Deaf with a capitalized ”D” to refer to members of the Deafcommunity [1] who use sign language as their preferred language, whereas deaf refers to the audiologicalcondition of not hearing

2The information of the company can be find on http://www.motionsavvy.com

1

http://www.motionsavvy.com

Introduction 2

two-handed represented letters of the alphabet. It is mostly used to represent names

and technical terms that are not defined in the sign language vocabulary.

1.1 German Dactylology

There are 31 signs in German finger spelling alphabet (”Fingeralphabet” in German) as

shown in Figure 1.1. Besides the 30 basic letters, a very high frequently used combination

”sch” is also defined. German Sign Language (DGS, Deutsche Gebarden Sprache) uses

a one-handed alphabet. Most of the letters are represented by static postures except

”J”, ”Z”, ”A”, ”O”, ”U” and ”β”.

Figure 1.1: German finger spelling alphabet from http://www.visuelles-denken.

de/Schnupperkurs3.html.

1.2 Introduction to State of the Art Gesture Recognition

1.2.1 General Process

Generally speaking, there are two main tasks involved in gesture recognition [9]: feature

extraction and feature classification. A feature, in the context of finger spelling recog-

nition, is a quantity used to describe the static and dynamic properties of the hands



Introduction 3

performing the gesture in a specific frame. It can be global (e.g, position and motion of

the hand) or local (e.g, angle between two fingers, orientation of the hand). Usually, a

set of features are used together to characterize a frame and is called a feature vector.

The purpose of feature extraction is to find a feature vector (static gesture) or a set of

feature vectors (dynamic gesture) that are corresponding to a gesture. A mathematical

model that best describe the feature vector(s) is then built. The model is in the form

of equations which contain a set of parameters. We call the process of optimizing the

parameters as the model training.

Feature classification finds which gesture class the extracted feature vector(s) belong to.

For different classification tasks, different algorithms can be used. Some of the common

methods are Support Vector Machines (SVM) [10] , Template Matching [11], Neural

Networks [12] and Hidden Markov Model (HMM) [13].

1.2.2 Input Methods

Input method refers to the nature of the data that is acquired by the capture device

during the performance of a signer. The most common input methods have been used

for gesture recognition, in the past few years, are based on computer vision and sensors.

For vision-based method, one or several cameras are used to provide frames from the

captured video sequences [9]. To extract useful features from a specific frame, image

processing methods like segmentation need to be applied first. These algorithms might be

time consuming, which is not ideal for real-time gesture recognition. In addition, vision-

based methods have strict requirements on the environmental conditions, for example,

bad lighting conditions may affect image interpretation significantly and finally lead to

low recognition performances.

In terms of sensor-based method, cyber gloves [14] are a commonly used device. The

signer wears a glove equipped with sensors which can provide hand tracking information

on the position, rotation, movement and orientation of the hand. By using these kind of

gloves, features can be obtained directly from the information returned by the sensors

and image processing is no longer needed. However, wearing a glove full of wires while

performing a gesture is inconvenient and the device itself is quite expensive.

Recent consumer-range acquisition devices like the Leap Motion Controller [8] and the

Microsoft Kinect [15] draws a lot of attention in the field of gesture recognition. Compare

to the Kinect, the Leap Motion is explicitly targeted to hand gesture recognition and

allows us to access the position of the fingertips and the hand orientation directly, which

might be an opportunity for robust gesture recognition.

Introduction 4

1.2.3 Static Gesture Recognition

Static gesture recognition methods are used when the gesture is kept still during the

time window which is allowed for the recognition, for example, most letters in German

finger spelling alphabet. To recognize these kinds of gestures, classifiers like Template

Matching and Neural Networks can be used. The most important characteristic of static

gesture recognition is to figure out how the individual parts of the object (e.g., hands,

body) which performs a gesture are arranged in relation to each other. A lot of researches

have been conducted related to static gesture recognition.

Schmidt et al. [16] introduce a methodology for real-time static gesture recognition

capable of dealing with the sparse data provided by the Leap Motion Controller. The

system extracts features that can characterize the frame from the Leap Motion, the

whole feature vector F is composed by four vectors measure different aspect of the

hand:

• The first vector f1 is a 5-dimension vector contains the distance of each finger tip

to the hand center.

• The second vector f2 is a 4-dimension vector contains the angles between the

vector of adjacent fingers.

• The third vector f3 is a 5-dimension vector holds the angle between the finger

vector and the hand’s normal.

• The fourth vector f4 is a 2-dimension vector holds the radius of the sphere created

by the hand’s curvature and the number of fingers detected.

where the finger vector for finger i is computed by υi = pi − c, with pi the fingertip

position and c the center of the hand. They use F = {f1,f2,f3,f4} to train two

classifiers: SVM and Random Forest (RF) [17]. The experiments are based on a data set

with 11 different gestures performed by 6 users. The results show that their methodology

provides an accuracy of up to 94.26% when using RF with 100 trees and depth of 25,

and an accuracy of 89.64% when using SVM. When the paper was published, the Leap

Motion could only provide information of eight 3D points (the hand center, the positions

of five fingertips, the normal of the palm and the radius of the sphere created by the

hand’s curvature) in a frame. The data is extremely sparse, but the way how they use

sparse positional data to create a feature set is instructive.

The study of Marin et al. [18] is similar to the work of Schmidt and his colleagues [16].

They all focus on static gesture recognition using the Leap Motion. When choosing

Introduction 5

the feature set, apart from the angle and distance information, they also introduce the

concept of fingertips elevation which represents the distance of the fingertip from the

plane corresponding to the palm region (accounting also for the fact that the fingertips

can belong to any of the two semi-spaces defined by the palm plane). The feature

set is fed into a multi-class SVM classifier in order to recognize the performed gestures.

Combined with a set of depth computed features from the Kinect, the system can achieve

an accuracy of 75% on a 10 gestures database.

1.2.4 Dynamic Gesture Recognition

Considering the fact that most gestures used in daily communication and HCI are dy-

namic gestures, a lot of approaches have been proposed to deal with dynamic gesture

recognition. When extracting features, apart from the relation of individual parts of the

object performing the gesture, the variation in time also should be taken into consider-

ation. Common methods used to model dynamic gestures including HMM, Finite State

Machines (FSM) [19], etc. Some researches related to dynamic gesture recognition are

listed bellow.

In Paul Goh’s Phd. thesis [9], he presents the Auslan Finger-spelling Recognizer (AFR),

a system capable of extracting and recognizing finger spelled letters consisting of Auslan

manual alphabet letters from monocular video sequences. In his system, signed letter is

modeled using a single HMM due to the fact that Auslan finger spelling uses both hands

and is inherently dynamic. The system uses a single USB camera for image recording.

In the feature extraction phase, skin regions are detected and a set of features which

include geometric features and an optical flow motion descriptor are extracted from

video frames. The features he uses are:

• Geometric Features

– Left hand angel of orientation

– Left hand area

– Left hand major axis length

– Left hand minor axis length

– Right hand angle of orientation

– Right hand area

– Right hand major axis length

– Right hand minor axis length

Introduction 6

• Motion-base features

– X-velocity optical flow histogram (bins for ranges -4 to 4)

– Y-velocity optical flow histogram (bins for ranges -4 to 4)

The letter models can be obtained using isolated training and further refined using

embedded training. Tests are based on a vocabulary of twenty signed word, the results

show that the system has the best performance when using a finite state grammar

network and embedded training with an accuracy of 97% at the letter level and 88% at

the word level. Although the results are promising, the system is not a real time system,

which means it can only recognize gestures from pre-recorded video sequences.

A library for gesture recognition dedicated to the Leap Motion Controller is presented

in the bachelor’s thesis of Nowiciki et al. [20]. This library aims at helping developers

building applications using gestures as a human-computer interface. The library contains

two kinds of gestures: static gestures and dynamic gestures. For static gestures, the

recognition is based on SVM. They tested their system using different feature sets on

different size of vocabularies. The experimental results show that the system can achieve

an accuracy of 99% on a five gestures vocabulary and 85% on a ten gestures vocabulary

when using pre-processing to remove noise from the training data, and the feature set:

• Number of fingers in a frame.

• Distances between fingertips to the position of the hand’s palm.

• Angles between fingers and normal of hand’s palm.

• Five greatest values of angles between all combination of finger pairs.

• Five greatest values of fingertip distances between all combination of finger pairs.

When it comes to dynamic gestures, the recognition is based on HMM. In terms of the

feature vector, in addition to the features used in static gesture recognition, the speed of

the hand as well as the magnitude of the hand’s displacement are introduced. The best

recognition rate 80% appears when testing on a 6 gestures vocabulary using a 6-state

HMM. For the first time, their work include the recognition of dynamic gesture using

the Leap Motion. However, the recognition rate is not satisfying since the test is based

on a very small gesture vocabulary.

Introduction 7

1.3 Our Objective

From the related work we can see that most of the studies in the past are dedicated

to a small gesture vocabulary. In the literature so far, there is no such a system that

can recognize continuous finger spelling on-the-fly. Therefore, we want to evaluate the

Leap Motion by developing a German finger spelling recognition system (GFRS) which

is capable of recognizing continuous German finger spelling in real time.

Chapter 2

The Proposed Method

In this chapter, we present the methodology used to design and implement the German

finger spelling recognition system (GFRS). Firstly, a motion capture device namely the

Leap Motion Controller is introduced. We latter introduce its most interesting features

and the reasons why we choose it as our input device. Secondly, we will explain the

concept of letter to letter transition witch is used to build the statistical model. Lastly,

a tutorial on the fundamentals of HMM and how it can be adapted to our system is

given.

2.1 The Leap Motion Controller

The Leap Motion Controller is a relatively small device with dimensions of 3 x 1.2 x 0.5

inches and designed for HCI purposes. Its first version was released on May 21, 2012 by

Leap Motion, Inc.

The device uses optical sensors and infrared light to recognize and track hands, fingers

and finger-like tools with a frame rate of approximately 300 frames per second. It uses a

right-handed Cartesian coordinate system with the origin centered at the top as shown

in Figure 2.1. The sensors direct along the y-axis and have a field of view of about 150

degrees. The effective range of the Leap Motion ranges between 25 and 600 millimeters

above the device. Combining the data from the sensors and an internal built hand

model, the device can deal with challenging conditions (e.g., part of one hand is covered

by another).

One of the most appealing features of the Leap Motion is its skeletal tracking model

which provides bone information additional to hand palm and fingertips. Combining

with the coordinate system, we can easily access the positions of finger joints, center of

8

The Proposed Method 9

Figure 2.1: The right-handed Cartesian coordinate system of the Leap Motion.

bones and fingertips as marked by green balls in Figure 2.2. Apart from positional infor-

mation, the Leap Motion also provides directional information of fingertips and palms as

can be seen in the Leap Motion Diagnostic Visualizer in Figure 2.3. All the information

contained in a frame returned by the Leap Motion can be encoded in a JavaScript Object

Notation (JSON) as shown in Appendix A. The company also provides a dedicated ap-

plication programming interface (API) for different programming languages with which

developers can acquire the information they need by simply invoking a function. In our

project, we use the JAVA language [21].

Figure 2.2: The skeletal trackingmodel of the Leap Motion enables us

to access the position of each joint.

Figure 2.3: The directional informa-tion represented by arrows that the

Leap Motion can provide.

2.2 Letter to Letter Transition

Most letters composing the German finger spelling alphabet are represented by static

postures. The common methods used for recognizing a static posture is to capture a

single frame corresponding to the posture and then feed it to SVM [10] or other static

gesture classifiers. This is so called isolated letter recognition. However, we are looking

for a system that can continuously recognize letters in real time with a relatively high

speed (2 letters per second). To build a system based on isolate letter recognition, the

system needs to select which frames to classify into letters during the performance of a

signer, which is a difficult task within that shot period of time.


The study of Susanna Ricco and Carlo Tomasi [22] shows us a new perspective: words

can be seen consist of letter to letter transitions. For example, as shown in Figure 2.4,

the word ”owl” can be seen as a composition of two transitions: ”o to w” and ”w to l”.

Figure 2.4: The word ”owl” consists two transitions: o to w and w to l. Picturesource: http://www.cafepress.com/+finger-spelling+journals

If we model the transition instead of isolate letter, the static gesture recognition problem

is converted to a dynamic gesture recognition problem. During the performance of a

signer, there is no need to find a specific frame corresponding to a letter, which is an

error-prone process at conversational speed. On the contrary, all the frames recorded

during the performance of the transition can provide useful information on building the

model, which we believe is more reliable than isolate letter modelling. In addition, there

is no ”gap” between two transitions, which means the end of one transition is exactly the

start of another one. If we can find the end or the start points of transitions properly,

continuous recognition will be easier by just running consecutive isolated recognitions.

2.3 Hidden Markov Model

One of the biggest challenges of activity recognition is to deal with the variation of

different signers, in our case, this means that there are always more or less differences

between two signers when performing the same transition. These variations are mainly

caused by the habits of the signers and can be barely avoided. As a matter of fact, even

one signer can not perform exactly the same movement twice. Therefore, a statistical

model is appropriate. Another fact we need to consider is the variations occurring along

the time dimension, that is, dynamics of the gesture or dynamics of individual parts of

the hand performing the gesture. Hidden Markov model [23] is a statistical model that

http://www.cafepress.com/+finger-spelling+journals


can deal with these kinds of variations. Its state to state transition mechanism enables

it to capture the changes of a signal over time, which is ideal for activity recognition.

2.3.1 Basic Structure

A hidden Markov model (HMM) is a statistical model in which the system being mod-

eled is assumed to be a Markov process with unobserved states [23]. The diagram in

Figure 2.5 shows the basic structure of an HMM. Each circle represents a state, we use a

random variable St to denote the hidden state at time t (with the model in the diagram,

we have St ∈ {1, 2, 3, 4, 5} ). The random variable Ot is the observation1 at time t

generated by the current state with probability bj(Ot) (in the diagram, we have b2(Ot),

b3(Ot), b4(Ot) corresponding to state 2, 3 and 4 respectively). Given an observation Ot,

we can not tell by which state it is generated, this is where the name ”hidden” comes

from. The arrows in the diagram denote conditional dependencies, which means a state

can only be reached from the states that have arrows point to it. For example, state 4

can only be reached from state 2, state 3 and itself. The variable aij above the arrow is

the probability of the transition from state i to j, namely the transition probability.

The conditional probability distribution of the hidden variable St+1 at time t+1 (future

state) depends only on the value of the hidden variable St at time t (present state), i.e,

the values before time t are irrelevant. This is the so called Markov property. Similarly,

the value of the observed variable Ot only depends on the value of the hidden variable

St, both at time t.

Figure 2.5: A sample of a 5-state hidden Markov model.

1Observation is a basic term of HMM, in this thesis, ”observation” has the same meaning with”feature vector” as will be discussed latter. Similarly, training sequence and recognition sequence areboth a sequence of observations (observation sequence) in a general sense.


2.3.2 Parameter Set

The state space of the hidden variables is discrete, we use N (number of states) to denote

the number of elements in the space.

Generally speaking, there is a transition probability from each state to any state includ-

ing itself. If there is no transition between two states, the transition probability is simply

set to 0. Therefore, for an N-state HMM, there are N2 possible transition probabilities,

denoted by a N × N matrix A. The ith row of A satisfies the constrain:∑N

j=1 aij = 1

with 1 ≤ i ≤ N .

Unlike the state space, the observation space can be either discrete or continuous and is

decided by the nature of the observed variable. If the observed variables are from a finite

integer set, the observation space is discrete. On the other hand, the observation space

is continuous when the observed variables are high-dimensional vectors. For each state,

there is a probability distribution governing the distribution of the observed variable

at a certain time given the state of the hidden variable at that time. We use B to

denote emission probability distributions for all the states. In addition, for the standard

type of HMM, we can not tell which state generate the first observation. Thus, an

N-dimensional vector π is used to denote the initial state distribution. Therefore, the

complete parameter set of an HMM can be indicated by the compact notation:

λ = (A,B, π). (2.1)

2.3.3 Three Basic Problems for HMM

A classic paper written by Lawrence R. Rabiner in 1989 [13] gave a concrete methodical

review of HMM. It proposed three basic problems for HMM as well as the solutions in

a mathematical way, which is the basis of HMM based activity recognition applications

in the past decades.

2.3.3.1 Problem 1

Given the observation sequence O = O1O2 · · ·OT , and a model λ = (A,B, π), how do we

efficiently compute P (O|λ), the probability of the observation sequence, given the model?

This is the evaluation problem and can be solved efficiently by using the first (forward)

part of the Forward-Backward algorithm. In his tutorial [13], the problem is solved by

induction and is described as follow:


The forward variable αt(i) is defined first:

at(i) = P (O1O2 · · ·Ot, qt = Si|λ), (2.2)

which is the probability of the partial observation sequence, O1O2 · · ·Ot, until time t

and state Si at time t, given the model λ. Then at(i) can be solved inductively:

1. Initialization

α1(i) = πibi(O1), 1 ≤ i ≤ N. (2.3)

2. Induction

αt+1(j) =

[N∑i=1

αt(i)aij

]bj(Ot+1), 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N. (2.4)

3. Termination

P (O|λ) =

N∑i=1

αT (i). (2.5)

The solution will be used for recognition as will be illustrated in the next chapter.

2.3.3.2 Problem 2

Given the observation sequence O = O1O2 · · ·OT , and the model λ = (A,B, π), how

do we choose a corresponding state sequence which is optimal in some meaningful sense

(i.e., best explains the observations)?

The purpose of this problem is to uncover the hidden part of an observation. To solve

this problem, the Viterbi algorithm [24] is applied. Although this problem does not

provide direct support to our system, it helps to solve problem 3.

2.3.3.3 Problem 3

How do we adjust the model parameters λ = (A,B, π) to maximize P (O|λ)?

This is an estimation problem. As is discussed in the previous section, combining with

the ideas of solving Problem 1 and Problem 2, we can solve this problem by using an

iterative procedure such as Baum-Welch algorithm [13]. The solution tells us how to

build an HMM using an observation sequence. In the tutorial, there is also a concrete

description about how to solve this problem.


Again, some variables should be defined first:

1. βt(i), which is the probability of the partial observation sequence from t+ 1 to the

end, given state Si at time t and the model λ.

βt(i) = P (Ot+1Ot+2 · · ·OT |qt = Si, λ) (2.6)

2. γt(i), which is the probability of being in state Si at time t, given the observation

sequence O, and the model λ. It can be expressed by forward-backward variable:

γt(i) =αt(i)βt(i)

P (O|λ)=

αt(i)βt(i)N∑i=1

αt(i)βt(i)

(2.7)

The denominator here to make sure∑N

i=1 γt(i) = 1.

3. ξt(i, j), which is the probability of being in state Si at time t, and state Sj at time

t+ 1, given the model and the observation sequence.

ξt(i, j) =αt(i)aijbj(Ot+1)βt+1(j)

P (O|λ)=

αt(i)aijbj(Ot+1)βt+1(j)N∑i=1

N∑j=1

αt(i)aijbj(Ot+1)βt+1(j)

(2.8)

If we sum γt(i) over index from t = 1 to t = T −1,∑T−1

t=1 γt(i), then we get the expected

number of transitions from Si. Similarly, we do the same to ξt(i, j),∑T−1

t=1 ξt(i, j), we

get the expected number of transitions from Si to Sj . Using the two quantities, we can

form a set of formulas to re-estimate λ = (A,B, π).

The initial state distribution:

πi = γ1(i) (2.9)

The transition probabilities:

aij =

T−1∑t=1

ξt(i, j)

T−1∑t=1

γt(i)

(2.10)

The emission probabilities:

bj(k) =

T∑t=1s.t.Ot=vk

γt(j)

T∑t=1

γt(j)

(2.11)


It has been proven by Baum and his colleagues that we can always get an equally good

or a better fitted model λ when using the current model λ to calculate the right-hand

sides of 2.8-2.10.

2.3.4 HMM Used in Our System

2.3.4.1 Topology - Left to Right

There are many kinds of HMMs. By now we have discussed a fully connected HMM in

which a state can be reached from itself or every other state in a finite number of steps.

However, this kind of HMM is not appropriate for modeling time signals. In our system,

this time signal is a sequence of feature vectors recorded by the Leap Motion during

the performance of a letter to letter transition by a signer, we call it the observation

sequence or the training sequence. To build a more suitable model, a special type of

HMM namely left-right model or Bakis model is used, it has the property that the state

index increases as time goes on, which is capable of modeling signals whose properties

change over time.

For a left-right HMM, the transition probability matrix A is a upper triangular matrix,

and the probability that the first observation is generated by the first state is always

one, which means the initial state distribution π is fixed. In the later chapters, we will

use λ = (A,B) to denote the parameter set of the HMM used in our system.

2.3.4.2 Emission Probability - Mixture of Multivariate Gaussians

The emission probability distribution for each state of the HMM is modeled by a Gaus-

sian Mixture Model (GMM). A GMM is a parametric probability density function rep-

resented as a weighted sum of Gaussian component densities and is commonly used

as a parametric model of the probability distribution of continuous measurements or

features in a biometric system [25]. It has the ability to form smooth approximations

to arbitrarily shaped densities with a linear combination of a set of different Gaussian

distributions. Because the feature vector in out system is high-dimensional with contin-

uous elements, each component of the GMM is represented by a multivariate Gaussian

distribution. Mathematically, a mixture of M-component multivariate Gaussian model

is given by:

p(x) =

M∑i=1

ωig(x|µi,Σi), i ∈ {1, ...,M} (2.12)


where x is a D-dimensional continuous-valued feature vector, ωi are the mixture weights

satisfy the constrain∑M

i=1 ωi = 1, and g(x|µi,Σi) are the probability density functions

of the components. Each component density has the form:

g(x|µi,Σi) =1

(2π)D/2|Σi|1/2exp{−1

2(x− µi)

′Σi−1(x− µi)}, (2.13)

where µi is the mean vector and Σi is the covariance matrix. The complete model is

composed by the mixture weights of all components, the mean vectors, and the covari-

ance matrices. For convenience, the mixture model is denoted by:

ψ = {ωi,µi,Σi}, i ∈ {1, ...,M} (2.14)

Due to the fact that the overall feature density is composed by a set of Gaussian com-

ponents, full covariance matrices are no longer needed [25]. The correlations between

the elements of the feature vector can be modeled by a linear combination of Gaussians

with diagonal covariances. All the formulas from 2.3 to 2.8 contain the calculation of

emission probabilities, which indicates that a lot of matrix computations are involved.

The usage of diagonal covariance can help a lot to reduce computational complexity

during the training (solving Problem 3) and recognition (solving Problem 3).

Chapter 3

System Architecture

This chapter discusses the system architecture of our German finger spelling system

(GFRS) and is divided into four sections: the first section explains how we choose the

features to represent shape and dynamics of a hand, the second and third section give

details on model training and recognition respectively, the last section finally introduces

a linguistic tool namely Bigram and explains how it can be adapted to our system.

GFSR is composed of four modules: the feature extractor, the training module, the

recognition module and the HMM database. A block diagram in Figure 3.1 gives an

intuitive illustration on how the four modules work with each other. There are two

processing pipelines presented in the diagram. The first one is represented by red arrows,

we call it the training. Through the training pipeline, we get the HMM database, which

is a precondition of the second pipeline, namely the recognition and is represented by

green arrows.

Figure 3.1: Block diagram of system architecture.

17

System Architecture 18

3.1 Feature Extractor

Phonetics, in the case of sign languages, is concerned with the articulation of hands: the

various hand shapes, locations, orientations and movements. In 1960, William Stokoe

introduced the first phonetic representation system [26] to describe signs. In his system,

a sign is divided into three aspects (tab, dez, and sig) to describe position, configuration,

and motion of the hands respectively. In 1989, Liddell and Johnson [27, 28] pointed out

that Stoke’s system ignore the internal segmental structure of signs and proposed the

Movement and Hold model. In their model, signs are composed of Movement (M) and

Hold (H) segments. An M is a segment during which some aspect of the articulation is in

transition while an H is a segment in which all aspects of the sign is steady. There are also

some other well known transcription models, for example, SLIPA [29] (An International

Phonetics Alphabet for Signed Languages). All these models break signs into a set of

abstract parts which bring us the task: how to describe the abstract quantities with real

values so as to build a mathematical model?

In order to build a transition model, one needs first to record a sufficient number of

training sequences. Each training sequence contains a series of frames of tracking data

from the Leap Motion during the performance of a transition. The task of the feature

extractor is to take some of the features from a frame to form a vector to represent

the hand’s static and dynamic properties so as to distinguish from other frames, which

is the so called feature vector. As can be seen from system architecture, the input for

both training and recognition module is a sequence of feature vectors. We believe that

feature vector plays a crucial rule in our finger spelling recognition system and must

be selected carefully. Features are selected by analysing the properties of the signs in

German finger spelling alphabet. There are 31 signs in the alphabet, which gives us

31×30 = 930 (not including self transition) transitions in total. Some of the transitions

share common properties and can be divided into three categories.

• Category 1 - Only hand shape changes.

The simplest kind of transition contains no hand movement nor change of hand

orientation. This is due to the fact that both the signs of letters in the transition

pair are static and have the same palm direction. For instance, A to B transition

in Figure 3.2. During these kinds of transitions, only the hand shape is changed.

Thus, we only need to to find some features which can describe the static hand

arrangement, that is, how the individual parts of the hand are arranged in relation

to each other.


Figure 3.2: A to B transition changes only the hand shape.

Features like the distance between two fingertips as well as the distance of one

fingertip to a specific point of the hand (e.g., hand center) are usually used to

represent a hand shape [30]. Taking advantages of the Skeletal Tracking Model of

the Leap Motion, the position of the finger joints are also taken into consideration,

for example, the distance between the thumb fingertip and a finger joint can be

used to distinguish letters whose signs have only slight differences (e.g., M and N

in Figure 3.3). However, the features mentioned above are low level features [31].

Inspired by Sandler’s phonological model of hand shape [32], the degree of openness

of a finger is also chosen as a potential feature. In Figure 3.4 , the length of the

vertical red line is a measure of finger openness, the length is maximized when the

finger is fully closed and minimized when the finger is fully opened.

Figure 3.3: Signs of M and N haveslight differences.

Figure 3.4: An illustration of theconcept of finger openness.

• Category 2 - Hand shape and orientation change.

A number of signs only differ from the orientation of the hand (e.g. D and G

in Figure 3.5). If only the features derived from category 1 are used to form the

feature vector, then we can not tell the difference between the two transitions ”A

to D” and ”A to G”.

In order to distinguish these kind of transitions. Information on hand orientation

should be included in the feature vector. We call it ”Hand Orientation Change

(HOC)”. The orientation can change along different axis of the Leap Motion, thus

HOC is a three-dimension vector containing orientation change around X-axis

(HOC-X), Y-axis (HOC-Y) and Z-axis (HOC-Z). We take the beginning frame

of a transition as a reference and set all the elements to 0.0, the values in the

subsequent frames is the angle of rotation (from 0 to 180 degree) around the

rotation axis compare to the reference frame.


• Category 3 - Hand shape and position change.

The posture of letter ”J” are the same with letter ”I” except that it contains an

extra hand movement as shown in Figure 3.6. A double valued feature is used to

capture the hand movement between two frames. We use a very similar method

like we use in category 2 to get this feature. They have even similar names, namely

”Hand Position Change (HPC)”. We also choose the beginning frame as a reference

and set the corresponding value to 0.0, the value of HPC in the subsequent frames

is the distance of the hand center of current frame to the reference hand center.

Figure 3.5: Signs of D and G. Figure 3.6: Signs of I and J.

It is worth to be noticed that all the features are relative quantities. The absolute

position of the hand and fingers are not relevant as features because a gesture is always

the same no matter where it is made, but they can be used to obtain other meaningful

features. Combining all the features derived from the three categories, we get the final

feature vector. Table 3.1 lists all the candidate features from which we can choose any

combination to form a feature vector. The features from category 2 and 3 are essential in

order to recognize transitions with orientation and position change of the hand, while the

number of features used from category 1 has big influence on the accuracy of recognition.

On the one hand, if a small number of features from category 1 are used, then it is not

sufficient characterize a hand shape so as to distinguish similar transitions. On the other

hand, if we use too many features, then a larger training data set will be needed.

Number Code Feature Description Unit Category

1 TC Thumb to Hand Center mm 1

2 IC Index to Hand Center mm 1

3 MC Middle to Hand Center mm 1

4 RC Ring to Hand Center mm 1

5 PC Pinky to Hand Center mm 1

6 TI Thumb to Index mm 1

7 IM Index to Middle mm 1

8 MR Middle to Ring mm 1

9 RP Ring to Pinky mm 1

Continued on next page


Table 3.1 – Continued from previous page

Number Code Feature Description Unit Category

10 EFC Extended Fingers Count PC 1

11 IO Index Finger Openness mm 1

12 HOC-X Hand Orientation Change X-axis degree 2

13 HOC-Y Hand Orientation Change Y-axis degree 2

14 HOC-X Hand Orientation Change Z-axis degree 2

15 HPC Hand Position Change mm 3

Table 3.1: Candidate features used in our system.

3.2 The Training

The purpose of training is to build a database containing all the transition models. In

our system, each model is trained separately. Every training sequence for a transition

is composed by on average 60 frames returned from the Leap Motion during the perfor-

mance of a signer. To make the models more reliable, at least 10 training sequences are

used to train one model, which means the signer has to perform several times on the

same transition. Once the training sequences for one transition are prepared, a first ap-

proximated HMM is built. After that is an iterative procedure, taking the training data

and the approximated HMM as arguments to the Baum-Welch algorithm to re-estimate

transition probabilities (A) and emission probabilities (B), and we get a new HMM as

the output. The iterative procedure stops when the new model does not improve the

probability that the training sequences are generated by it compare to the model from

last iteration. Figure 3.7 gives an intuitive illustration of the procedure as a flow chart.

3.2.1 HMM Initialization

According to Rabiner’s tutorial [13], the Baum-Welch Algorithm leads to local maxima

only. In addition, when the dimension of the feature vector is high or the observation

sequences are very long, it might take very long time to meet the convergence. To

improve efficiency, only a certain number of iterations are applied during the training.

In order to obtain a more suitable model within a limited number of iterations, we pay

more attention on the first approximation of the HMM.

Actually, there is no straightforward methods to initialize an HMM. Experiments shows

that both random and uniform initial estimates of A can give useful re-estimation result


SequenceSegmentation

Initial HMMTraining Sequences

Baum-WelchAlgorithm

New λ = (A,B)

Convergence?

Final λ

yes

no

Figure 3.7: Flow chart of the training pipeline.

in most applications. However, an appropriate initialization of B is essential for HMMs

with a mixture of continuous emission probability distribution. This can be done by

using sequence segmentation.

3.2.1.1 Sequence Segmentation

Given several training sequences, for an N-state HMM, the very first thing to do to

initialize it is to segment each training sequence into N groups, and each group is labeled

by number from 1 to N. When all the training sequences are segmented, all groups labeled

by same number will be merged to one group. Finally, we allocate each group to the

corresponding state and use them to initialize emission probability distributions. The

problem is, how do we segment training sequences in a meaningful sense?

There are a number of methods dealing with sequence segmentation, one of the famous

methods is called K-mean clustering [33]. The aim of K-mean clustering is to partition


a group of observations into K sets so as to minimize the within-cluster sum of squares

(WCSS). However, this method does not maintain the time evolving property of the

observation sequence. In our system, the histogram-based entropy estimation [34, 35] is

used, which manually segment a sequence into states with averaging observations within

states. During the segmentation, entropy is used to measure the amount of similarity

between streams of feature vectors. The lower the similarity (big entropy), the more

probably it is a segmentation point. The process contains the following steps to segment

one training sequence:

1. Choose a window size W, put the first W feature vectors of the sequence in the

window as shown in Figure 3.8. Build a GMM with probability density func-

tion (PDF) f(x) using the expectation-maximization (EM) algorithm as will be

illustrated in section 3.2.1.3 on the W feature vectors.

2. Compute the entropy H(X) of the W feature vectors using the formula:

H(X) = −W∑i=1

f(xi)logf(xi). (3.1)

Where xi is the ith feature vector in the window.

3. Move the window every time by one feature vector, repeat step 1 and 2 until

the window reaches the end of the training sequence. Then we get a sequence of

entropies.

4. Find the biggest N−1 peak values in the entropy sequence, and the corresponding

indices of the training sequence are the segmentation points. If there are less than

N − 1 peaks, the training sequence will be segmented equally.

The result of sequence segmentation will be use to initialize B.

Figure 3.8: An illustration of entropy estimation with W=4.


3.2.1.2 Transition Probabilities (A) Initialization

To initialize A, a proper value in the interval [0, 1] should be placed in aij . The initial-

ization contains the following steps:

1. Due to the fact that A is a upper triangular matrix, put value 0 to aij with i > j.

2. The transition probability from the current state to its next reachable state is

initialized with the value p and is obtained by:

First, compute the average number of observations of all training sequences O

= [O(1),O(2),· · · ,O(k)] using:

Onum =

k∑i=1

O(i)num

k, (3.2)

where k is the number of training sequence, andO(i)num is the number of observations

in the ith sequence. p can be calculated using Onum and the number of state N:

p =1

Onum/N, (3.3)

3. For aij with i >= 0, j = i+ 1, j < N − 1, put value p.

4. For aii with i < N − 1, put value 1− p.

5. For aii with i = N − 1, put value 1, because there is no transition to other states

from the last state.

The initialized transition matrix should have the form:

AN,N =

1− p p 0 · · · 0

0 1− p p · · · 0...

.... . .

. . ....

0 0 0 · · · 1

(3.4)

3.2.1.3 Emission Probabilities (B) Initialization

After sequence segmentation, each state gets a set of training feature vectors. We want

to find a GMM, ψ, which in some sense best matches the distribution of the training

feature vectors.


Formally, the problem can be described as: given a set of T training feature vectors

O = {o1,o2, · · · ,oT }, how to find the model parameters which maximize the likelihood

of the GMM, P (O|ψ)?

p(O|ψ) =T∏t=1

P (ot|ψ). (3.5)

The parameters can be obtained iteratively using the EM algorithm. The EM algorithm

works as follow: begin with an initial estimate ψ (e.g., ramdom), use it to estimate a new

model ψ such that P (O|ψ) ≥ P (O|ψ). The new model ψ is then treated as the initial

model in the next iteration. The iteration terminated when the convergence criterion is

met.

During each iteration, the model parameter is estimated by the following formulas:

Mixture Weights

ωi =1

T

T∑t=1

Pr(i|ot, ψ). (3.6)

Means

µi =

T∑t=1

Pr(i|ot, ψ)ot

T∑t=1

Pr(i|ot, ψ)

. (3.7)

Variances

σ2i =

T∑t=1

Pr(i|ot, ψ)o2t

T∑t=1

Pr(i|ot, ψ)

− µ2i , (3.8)

where xt, µi, and σ2i refer to elements of xt, µi, and σ2i , respectively.

The posteriori probability for component i is given by:

Pr(i|ot, ψ) =ωig(ot|µi,Σi)

M∑k=1

ωkg(ot|µk,Σk)

(3.9)

The formula used above guarantee a monotonic increase in the model’s likelihood value,

and the value can be improved significantly after a few iterations. Therefore, in the

initialization phase, only a certain number of iterations are performed.


3.2.2 Baum-Welch Algorithm

In section 2.3.3 of Chapter 2, we have discussed the general Baum-Welch algorithm

which can be used to train an HMM. Formula 2.9-2.11 use only one training sequence

to estimate the parameters of a fully connected HMM. However, in our system, we are

using left-right HMMs and multiple training sequences. The estimate formula should be

refined.

Given a set of K observation sequences O = [O(1),O(2), · · · ,O(k)], where O(k) =

[O(k)1 O

(k)2 · · ·O

(k)Tk

] is the kth observation sequence. Under assumption that all the obser-

vation sequence are independent of each other, our goal is to find the parameter set λ

which maximizes

P (O|λ) =

K∏k=1

P (O(k)|λ) =

K∏k=1

Pk. (3.10)

Due to the fact that the re-estimation procedure is based on the occurrence of var-

ious events, the re-estimation for multiple observation sequences is just summing up

individual occurrence frequencies of each sequence, given by:

Transition Probabilities

aij =

K∑k=1

1

Pk

Tk−1∑t=1

αkt (i)aijbj(O

(k)t+1)β

kt+1(j)

K∑k=1

1

Pk

Tk−1∑t=1

αkt (i)βkt (j)

. (3.11)

To re-estimate the emission probabilities is actually to re-estimate the parameters of

GMMs. In the initialization phase, only a set of feature vectors are used to train GMM

parameters. In the phase of Baum-Welch algorithm, we use transition probability to

adjust the parameters so that P (O|λ) can be maximized. The re-estimation formulas

are given by:

Mixture Weights

cjn =

K∑k=1

1

Pk

Tk∑t=1

γkt (j, n)

K∑k=1

1

Pk

Tk∑t=1

M∑n=1

γkt (j, n)

. (3.12)

Mean Vectors


µjn =

K∑k=1

1

Pk

Tk∑t=1

γkt (j, n) ·O(k)t

K∑k=1

1

Pk

Tk∑t=1

γkt (j, n)

. (3.13)

Covariance Matrices

Σjn =

K∑k=1

1

Pk

Tk∑t=1

γkt (j, n) · (O(k)t − µk

jn)(O(k)t − µk

jn)′

K∑k=1

1

Pk

Tk∑t=1

γkt (j, n)

. (3.14)

where γkt (j, n) is the probability of being in state j at time t with the nth mixture

component accounting for O(k)t , given by:

γkt (j, n) =

αkt (j)βkt (j)

N∑j=1

αkt (j)βkt (j)

cjnψ(Ok

t ,µkjn,Σ

kjn)

M∑m=1

cjmψ(Okt ,µ

kjm,Σ

kjm)

. (3.15)

3.3 The Recognition

The precondition to run the recognition is that there are trained transition models in

the HMM database. The recognition procedure can be described as:

1. The user performs a transition in the effective area of the Leap Motion, an obser-

vation sequence O that contains a set of feature vectors is returned for recognition

and is called the recognition sequence.

2. For every HMM λi in the database, use the forward-backward algorithm as de-

scribed in section 2.3.3 of Chapter 2 to calculate Pi = P (O|λi), which is the

probability that O is generated by model λi.

3. Find the biggest probability value P , and the corresponding model λ is the recog-

nized transition.

During the training, we prefer to use multiple long training sequences because that a

big training data set can help to build a more reliable model. However, this dose not


Recognition Sequence

Forward-BackwardAlgorithm

HMM Database

(Pi, λi); i++

i >databasesize?

λ with biggest P

λi

no

yes

Figure 3.9: Flow chart of the recognition procedure.

apply for recognition sequence. In fact, the average length of the recognition sequence

in our system is only 10 to avoid the numerical underflow problem [36]. The computa-

tional complexity of the forward-backward algorithm is 2TNT (with T the length of the

sequence and N the states number of the HMM), which contains N(N + 1)(T − 1) +N

multiplications. It can be seen from the formula that the number of multiplication is

proportional to the length of the sequence. Because the multipliers are probabilities

with value located in the interval [0,1], the result becomes smaller and smaller as the

number of multiplications increases and finally lead to computer underflow.

3.4 Bigram

A bigram is a sequence of two adjacent elements in a string of tokens, which are typically

letters, syllables, or words. In our system, a bigram refers to every letter to letter

transition. As is discussed in the previous section, there are 930 possible transitions

if self transitions is not considered. To recognize a transition, the Forward-Backward


algorithm will be applied 930 times to find the best fit model. Clearly, this is too

expensive for real time gesture recognition system.

In fact, not all the transitions can occur in German word formation. To remove the

unnecessary transitions and minimize the size of the HMM database, a program called

the German Bigram Counter (GBC) to analyse the occurrence probability of transitions

in some corpora has been implemented. The program reads corpora in the form of

textual files from local folder, and shows the statistical result in a 31 × 31 matrix as

shown in Figure 3.10.

Figure 3.10: A screen shot of the GBC matrix.

Figure 3.11 shows the statistical curve when testing on a German corpus which con-

tains 16026 sentences downloaded from a corpora website [37], all the sentences are

extracted from some literary works written in German. We can see that the curve drops

dramatically after a certain point. This indicates that the transitions after that point

barely occur in German word formation. If these transitions are ignored, we get the

most frequently used 100 transitions which covers 80% of letter to letter transitions in

German.

Figure 3.11: The occurrence probability of transitions in a 16026 sentences corpus.


Table 3.2 lists the top ten most frequently occurred letter pairs. As can be seen from

the table, letter pair ”EN” wins the competition, this actually makes sense because all

the verbs in German are ended with ”EN”. The complete list that contains all the 100

transitions can be found in Appendix B.

Transition Probability(%)

EN 4.057

ER 3.767

CH 2.678

DE 2.196

EI 2.110

ND 2.015

TE 1.818

IE 1.631

IN 1.614

UN 1.604

Table 3.2: Top ten most frequently used letter pairs

Chapter 4

System Implementation

The German finger spelling recognition system (GFSR) is developed using Eclipse1 on

a laptop with a 32-bit windows 7 system installed and makes use of Java library in the

Leap Motion software development kit (SDK) and an HMM toolkit called Jahmm [38].

It has been tested that the system can run on Linux and MAC OS.

This chapter describes details on the system implementation and is divided into four

sections. The first section introduces the HMM library we use. The second section

explains how the functions are implemented in terms of class diagram. The third section

gives a description on the user interface. In the last section, we present some obstacles

encountered during the implementation as well as their solutions.

4.1 Jahmm

Jahmm is a Java library that contains the implementation of HMM related algorithms.

It is mainly designed for research and teaching purposes. Therefore, all the algorithms

are implemented in a general manner according to the theory, and the code can be easily

understood and modified to adapt to different applications. There are six packages in

Jahmm and each package has different function:

• run.distributions implements various pseudo-random distributions.

• run.jahmm is an HMM implementation.

• run.jahmm.draw helps drawing HMM-related objects.

1Eclipse is an integrated development environment (IDE). It contains a base workspace and anextensible plug-in system for customizing the environment. Eclipse SDK is free and open source softwareunder the terms of the Eclipse Public License. Different releases can be found in http://www.eclipse.

org/downloads/, and we use the Indigo release version.

31

http://www.eclipse.org/downloads/

http://www.eclipse.org/downloads/

System Implementation 32

• run.jahmm.io holds classes that read and write HMM-related objects.

• run.jahmm.learn holds HMM-related learning algorithms.

• run.jahmm.toolbox holds HMM-related tool algorithms.

4.1.1 Main Classes

The most basic and important class of Jahmm is Hmm, it contains all the elements of

an HMM: number of states, state to state transition probabilities, emission probability

for each state, and a bunch of methods to set and return the parameters of an HMM.

The class Observation defines the observation of an HMM. This observation can be dis-

crete or continuous, it can be an integer or a double valued vector. Jahmm implements

some commonly used observations:

• The ObservationInteger class holds integer observations.

• The ObservationDiscrete class holds observations whose values are taken out of

a finite set.

• The ObservationReal class holds real observations (implemented as a double).

• The ObservationV ector class holds vector of real values (implemented as doubles).

A sequence of observations is simply implemented as a vector of observations. A set

of observation sequences is implemented using a vector of such vectors. To be useful,

each kind of observation should have at least one observation probability distribution

function. For example, the ObservationV ector class can be used together with the

class OpdfMultiGaussian, which implements a multivariate Gaussian distribution. The

Viterbi, Forward-Backward, and Baum-Welch algorithms are implemented in the class

V iterbiCalculator, ForwardBackwardCalculator, BaumWelchLearner respectively.

Table 4.1 lists the classes of Jahmm that are used in our system:

Package Class

run.jahmm

ForwardBackwardCalculator

Hmm < OextendsObservation >

ObservationV ector

V iterbiCalculator



Table 4.1 – Continued from previous page

Package Class

run.jahmm.io

HmmReader

HmmWriter

ObservationV ectorReader

ObservationV ectorWriter

OpdfWriter < OextendsOpdf <? >>

OpdfReader < OextendsOpdf <? >>

run.jahmm.learn BaumWelchScaledLearner

Table 4.1: Classes used in our system from Jahmm.

4.1.2 Extension to Jahmm

Although Jahmm has already implemented some commonly used state emission prob-

ability distributions, the mixture of multivariate Gaussian distribution that is required

in our system is not one of them. Therefore, based on the already existing distributions,

some classes related to Gaussian mixture model is added to the corresponding packages

as shown in Table 4.2.

Package Class

run.jahmm OpdfMultiGaussianMixture

OpdfMultiGaussianMixtureFactory

run.jahmm.io OpdfMultiGaussianMixtureReader

OpdfMultiGaussianMixtureWriter

run.distributions MultiGaussianMixtureDistribution

Table 4.2: Extra classes added to Jahmm

4.1.3 Data Storage

The data used in our system is stored on the local disk of the computer running it, we

call it the workspace and name it by GFRworkspace. The first task of the system when

it is launched is to check if the GFRworkspace folder and its subfolders exist, if not,

the system will build new ones. The directory structure of the workspace is shown in

Figure 4.1.

The workspace contains five folders and the data in different folders are used for different

purposes.


GFRworkspace

Training Sequences

HMM Database

Backup

Training

HMM

TestLogs

Test Sequences

Figure 4.1: Workspace directory.

• Training Sequences

When the system is running, all the newly recorded training sequences are stored

in this folder. The sequences used to train one transition model are encoded in

one file. The user can only use data from this folder to train the models. The

folder will be cleaned up and all data files will be moved to Backup folder when

the system exits.

• HMM Database

When using data from Training Sequences to train models, the trained models are

put into this folder. Each model is corresponding to one file. Recognition can only

base on the models in this folder. Together with Training Sequences folder, they

are called the current workspace. Also, when the system exits, this folder will be

copied to Backup folder.

• Logs

When one isolate recognition is run, a textual log file recording the computation

process will be stored in this folder. The logs are mainly used for debugging

purpose.

• Backup

Files from folder Training Sequences and HMM Database can be stored in this

folder as backup. The user can also import them from Backup folder to current


workspace so that he/she do not need to record training data or build HMM

database by his/her own.

• Test Sequences

This folder contains recognition sequences (test sequences) used for testing pur-

pose. Each file in the folder contains exactly one sequence corresponding to a

transition. Combining with models in HMM Database, we can evaluate the sys-

tem in terms of recognition accuracy.

There are mainly two kinds of data needed to be stored in the workspace: the observation

sequence data (including the training sequences and test sequence), and the trained

HMMs. Both are encoded in textual files.

Training sequences are written to a textual file by class ObservationWriter. One file

contains the necessary sequences to train a model, for example, 3 training sequences

with different length are stored with the form:

obs11 ; obs12 ; obs13 ;

obs21 ; obs22 ; obs23 ; obs24 ; obs25 ;

obs31 ; obs32 ; obs33 ; obs34 ;

The file is named by the transition letter pair and has extension ”.seq”. For exam-

ple, the file that contains the training sequences of A to B transition is named by

”AB.seq”. Training sequences can be read from the corresponding file using class

ObservationReader. As for test sequence file, the only difference from training se-

quences file is that it contains only one observation sequence.

In terms of HMM, the syntax is quite straightforward. For a 4-state HMM with integer

emission probability distribution for each state, the textual file should be like:

Hmm v1.0

NbStates 4

State

Pi 0.3

A 0.1 0.2 0.3 0.4

IntegerOPDF [0.2 0.8 ]

State

Pi 0.3

A 0.2 0.4 0.2 0.2


State

Pi 0.2

A 0.3 0.3 0.2 0.2



State

Pi 0.2

A 0.2 0.2 0.4 0.2


It is worth point out that all the white spaces between lines and words are treated as

one space, in other words, the whole file can contain only one line. When read from the

file using class HmmReader, the parameters are recognized according to the keywords

before them.

The first line just gives the version number of the file syntax. The keyword NbStates

gives the number of states of the HMM described. After that comes the descriptions of

lists of states, the states appears in order and each state is composed of:

• State indicates the start of a state description.

• Pi is the probability that this state is an initial state.

• A is a list that contains the state to state transition probabilities. If the state

currently described is numbered by i, then the jth probability of the list is that of

going from state i to state j.

• IntegerOPDF is the state emission probability distribution. The syntax depends

on the type of distributions. The example given above describes integer distribu-

tions. It begins with the IntegerOPDF keyword followed by an ordered list of

probabilities between brackets. In the example above, if the first probability is

related to the integer ”0”, the second to ”1”, then the probability that the first

state emits ”1” is equal to 0.8.

Like the sequence files, we add extension ”.hmm” to HMM files, and the files are also

named by the names of transitions.

4.2 Class Diagrams

The whole project is composed by 5 packages:

• userInterface contains all the classes related to graphic user interface of GFRS.

• dataCollection holds classes that are responsible for sequence data recording, in-

cluding training sequences and recognition sequences.


• hmm contains classes that implement interfaces of Jahmm as well as training

related operations.

• recognitionModule contains recognition related classes.

• util implements tools (e.g., sequence segmentation) needed to support different

functions.

Classes from different packages are used together to implement three main functions:

recording, training and recognition. The class diagrams of the three functions are repro-

duced in Appendix C. The source code of this project is available on the github server:

https://github.com/TengfeiWang/GermanFingerspellingRecognizer.

4.3 User Interface

Figure 4.2 shows the main frame of GFSR. The frame can be divided into three parts:

the menu bar on the top, the button list area on the left and the functional area on the

right.

Figure 4.2: The main frame of GFSR.

In the functional area, different panels will be switched to each other based on button

clicking operations. When the system launched at the first time, the user can only access

the recording panel while the 2:Train and 3:Recognition button are disabled due to the

https://github.com/TengfeiWang/GermanFingerspellingRecognizer


fact that there is no training data or trained models in the current workspace. Once

training data is recorded, the training is activated and then the recognition.

4.3.1 Data Recording

The recording panel contains two parts as shown in Figure 4.3, one part is the entrance

to the recording procedure and another allows the user to access the recorded data.

All the transition that need to be modeled are pre-stored in a hash map called AllTran-

sitions with the transition name as the key and the end letter of the transition as its

value. For example, A to B transition are stored in the form of (AB,B). When the

recording is started, the system checks how many transitions have already been recorded

in Training Sequences folder and deletes them from AllTransitions, then we get a new

hash map UnrecordedTransitions which contains all the unrecorded transitions. The

transitions in UnrecordedTransitions will be recorded alphabetically, the user just has

to follow the instruction of the popped up frame in Figure 4.3.

Figure 4.3: A popped up frame that gives instructions to the user about whichtransition to perform.

The recording will stop when any of the three conditions are met:

• All the transitions in UnrecordedTransitions have been recorded.

• No hands detected by the Leap Motion.

• The user manually stops the recording by press any key on the keyboard.


4.3.2 HMM Training

The user has no access to the training panel unless there is training data in folder

Training Sequences. Training data can be obtained either by starting the recording

procedure or by importing from Backup folder. The training panel contains three parts

as shown in Figure 4.4: the elements of currently used feature vector are list on the top,

the middle part contains buttons to start the training procedure as well as the access to

the HMM database, on the bottom is a monitor which allows the user to get information

about the training process (i.e., how many models are successfully trained).

Figure 4.4: The training panel.

It is worth point out that the feature vector recorded in the data collection phase con-

tains all the features listed in Table 3.1 and is called a full feature vector. All the

possible feature vectors can be obtained by extracting corresponding elements from the

full feature vector, which makes sure that there is no need to collect training data again

when we want to change the feature vector.

Before the training, a new frame as shown in Figure 4.5 pops up and ask the user to

specify a feature vector and states number of the HMM to be used in the training phase.

The user can choose any combinations of features in the check boxes, the red lines drawn

in the skeletal model on the right part of the frame give intuitive illustration about what

the features measure. The user can also specify the states number of the HMM that are

about to be trained in this frame, the value is 5 by default. Once the feature vector and


states number are settled, the system starts to prepare training sequences by extracting

feature vectors from the full ones.

Figure 4.5: A popped up frame in which the user can configure the feature vectorand states number of the HMM.

4.3.3 Recognition

Similar to the training, the user can run recognition when there are trained HMMs in

folder HMM Database. The models can be obtained either by starting the training

procedure as described in the last section or by importing from Backup folder. When

the models are imported, the system can also obtain the information on what feature

vector as well as the states number of the HMM used to build the models by reading a

textual file. The textural file is generated every time when the models in folder HMM

Database are copied to folder Backup.

The system implements two kinds of recognition: isolated and continuous recognition.

The panel for isolated recognition is shown in Figure 4.6. The user click the Run

Recognition button on the top to record a recognition sequence which is then sent to the

recognition pipeline. The computational process, that is the probabilities the sequence is

generated by each model in the database, are listed in the middle. The final result shows

up on the bottom after the computation is done. The panel for continuous recognition

is very similar to that of isolate recognition except that the computational process part

is eliminated.


Figure 4.6: The isolated recognition panel.

4.4 Implementation Issues

While implementing the system, we encountered two big challenges: the first appeared

when we want to record training data, that is, how to obtain useful data from the Leap

Motion while ignoring the redundant information in an efficient manner, the second

challenge is how to continuously recognize transitions during the performance of a signer.

4.4.1 Data Acquisition

To improve user experience, we prefer to use less key press and mouse operations when

recording sequences. This means the system has to detect the start and the end of

a transition performance automatically without interaction between the user and the

keyboard. This can be achieved by taking advantage of the Leap Motion’s high frame

rate. In our system, a training sequence is recorded by the following steps:

1. The signer follows the instructions in the frame in Figure 4.3: put the dominate

hand in the effective area of the Leap Motion with the posture of the start letter.

2. When the timer goes to 0, the Leap Motion return a frame of data corresponding to

the first posture (hand shape of the start letter) and send it to the feature extractor,


then we get the first feature vector F0 of the sequence. In the meantime, the signer

can start to perform the transition to the end letter.

3. Although we have obtained the first feature vector of the sequence, the recording

has not started yet. The first feature vector is more like a reference, the start of

the recording is detected based on it. After the timer goes to 0, the Leap Motion

starts to return feature vectors extracted from the latest frame every 10 ms. We

calculate the difference between the current feature vector Fc and the reference

F0. Assume each feature vector Fi contains D elements from Fi0 to FiD−1 , then

the difference is calculated by:

| F0 − Fc |=D−1∑i=0

| F0i − Fci | (4.1)

If the difference is bigger than a given threshold θ1, which means the performance

is started, we put Fc as a second feature vector into the sequence; otherwise, ignore

it and wait for the next feature vector.

4. Once the recording is started, we put every feature vector returned from the Leap

Motion to the sequence until the end of transition criteria are met. The criteria

are:

(a) 500 ms have passed since the recording started.

(b) The difference of two successive feature vector is smaller than a threshold θ2.

The Leap Motion will check if hands can be detected in its effective area constantly. If

no hands detected during the recording, the recording will stop immediately and the

already recorded feature vector for the sequence will be ignored. The two thresholds

θ1 and θ2 used during the recording are not necessarily equal, usually we set θ1 > θ2

because a bigger θ1 can make sure that the detection of the start is reliable.

4.4.2 Continuous Recognition

One main advantage of modeling letter to letter transition is that it makes real-time

continuous recognition easier. The end of one transition is exactly the start of another,

which means we do not need to deal with the interaction between two transitions. In

our system, the time needed to recognize a transition from a 100 models database is less

than the average time needed to record a transition. Therefore, there is no need to chain

the transition models to form a word level HMM like they do in other literature [9, 22].

The challenge here becomes how to detect the start or end of a transition during the


continuous performance of a signer. The problem can be solved by using the same

method when dealing with training sequence segmentation in section 3.2.1 of Chapter 3,

that is, entropy estimation. We say the end of a transition is detected when the entropy

of the feature vectors in a window is smaller than a threshold θ3, because there will

always be an unintentional hesitation of the signer when a transition is performed which

results in the feature vectors returned in that shot period of time are very close to each

other (low entropy). We run recording, classifying and recognition in parallel as shown

in Figure 4.7, the three threads running at the same time and keep communicating with

each other.

Figure 4.7: Three threads for continuous recognition.

Thread 1 is responsible for sequence recording. All the feature vectors returned from

the Leap Motion are put into a list. The thread will be killed when no hands detected

by the Leap Motion or the recording is stopped by the signer.

Thread 2 keeps calculating entropy of a set of (within a window) feature vectors from

the already recorded sequence. If the end of a transition is detected, put the sequence

corresponding to the transition in a queue and continue to detect next transition point.

Thread 3 keeps checking if there are unrecognized sequences in the queue. For the first

sequence in the queue, an isolate recognition based on all the models in folder HMM

Database is run. After that, the recognition will be based on at most 30 models, because

the start letter of the next transition is known and there are only 30 possible transitions

from one letter to others. Thread 3 will be terminated when thread 1 has already been

killed and there is no unrecognized sequences in the queue.

Chapter 5

Experimental Results

This chapter presents a partial evaluation of our German finger spelling recognition

system (GFRS) and is divided into three sections: the first section describes the exper-

imental environment and the acquisition of the experimental data, the second section

shows details on how the experiments are conducted as well as their results, and a

discussion on the experimental results is presented in the last section.

5.1 Preparation

We want to evaluate our system in terms of recognition accuracy by experimenting on the

100 most frequently used transitions in German word formation as listed in Table B.1.

The experiments will be conducted on both isolated transition recognition and word

level continuous recognition. However, our focus will be on isolated recognition due to

the fact that continuous recognition is implemented by just running consecutive isolated

recognitions. For each transitions in the list, we will run an isolated recognition, the

performance of the system is determined by the number of transitions that are correctly

recognized.

As discussed in section 4.3.2 of Chapter 4, a sequence of full feature vectors (contains

all the features listed in Table 3.1 ) will be returned by the Leap Motion during the

performance of a transition. Furthermore, every time before training, the user can

specify a new feature vector whose elements are a subset of the full feature vector as

well as the states number of the HMM that will be used in the training. The elements

of the new feature vector can be extracted from the full ones, which makes sure that the

user do not have to record the training data again when a change of the feature vector

or the states number of the HMM is needed. Therefore, in our experiments, the training

data for the 100 transitions are recorded before the experiments.

44

Experimental Results 45

All training data for the 100 transitions is recorded by one person 1 who is familiar with

the system and German finger spelling. To eliminate the influences of other applications

running on the computer, all irrelevant processes are killed before the recording. The

training data for each transition contains 10 training sequences. During the recording,

the temperature of the Leap Motion can become quite high after running for a while,

which can affect its tracking performance. To acquire more reliable data, we let the signer

take a break after the training data for every 5 transitions is recorded while the Leap

Motion can cool down. In addition, we calibrate the device and clean its surface glass

regularly. The training data is stored in folder Training Sequences in the workspace.

5.2 Experiments

There are many potential factors that may affect the performance of the system:

• Dimension and elements of the feature vector,

• States number of the HMM,

• Size of the HMM database (number of transitions),

• Size of training data (number of training sequences) for each model.

Since the training sequences are pre-recorded, we therefore only experiment on the first

three factors to find an optimal configuration for our system.

5.2.1 Isolate Recognition

For convenience, instead of running one isolated recognition right after a recognition

sequence is recorded, we pre-record exactly one recognition sequence for each transition

in the transition list so that we can run 100 times of isolated recognitions by just clicking

a button. The recognition sequences are recorded by the same person who records the

training data and are stored in Test Sequences folder. The evaluation can be run when

the recognition sequences and the HMM database are ready. The recognition sequences

are stored in Test Sequences folder while the HMM database in HMM Database folder.

Files in the two folder should be same in number and names except the extension.

The files in HMM database is obtained through the training pipeline using the pre-

recorded training data in folder Training Sequences. Because the user can specify the

1We are dealing with German finger spelling, the person only need to learn how to perform the31 signs in German manual alphabet. In our case, the ability to use the system is more important.Therefore, the person here is not a Deaf signer.


feature vector and the states number of the HMM before training, all the three factors

mentioned above are embedded in the HMM database.

When the evaluation starts, the system takes each sequence file in folder Test Sequences

and runs an isolate recognition based on the HMM database until all the sequences

are recognized. For example, a file namely ”AB.seq” is put into the recognizer, if the

probability that this sequence is generated by model ”AB.hmm” in the database is

the highest, we say that this sequence is successfully recognized. The final recognition

accuracy is the number of successfully recognized sequences divided by the total number

of sequences.

5.2.1.1 Experiment on Feature Vector

We run several tests on different feature vectors. As is discussed in Chapter 2, we have

15 candidate features that can be used to form the feature vector. The user can choose

any subset of them before training. Features 12, 13, 14 and 15 from category 2 and 3 are

essential in order to recognize transitions with hand rotation or movement and should

be included in all the feature vectors. Therefore, the feature vector is mainly determined

by features chosen from category 1. We choose different combinations of features from

category 1 that we believe can represent the hand shape. If the element of the feature

vector is represented by its corresponding number in Table 3.1, the feature vectors can

be represented by:

Feature Vector Features

A 1, 2, 3, 4, 5, 12, 13, 14, 15

B 1, 2, 3, 4, 5, 10, 12, 13, 14, 15

C 6, 7, 8, 9, 10, 12, 13, 14, 15

D 1, 2, 3, 8, 9, 10, 12, 13, 14, 15

E 1, 2, 6, 7, 8, 9, 10, 12, 13, 14, 15

F 1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

G 1, 2, 3, 6, 7, 8, 9, 10, 12, 13, 14, 15

H 1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15

I 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15

Table 5.1: 9 feature vectors used in the experiments.

To make the experimental results convincing, for different feature vectors, we use HMM

with same topology (i.e, 3-state left-right HMM) to train each transition model based

on the same training data recorded in the preparation phase and also use the same


recognition sequences to run the tests. During each test, we care about the recognition

accuracy, the time needed to train the models and the time needed to run an isolate

recognition. The results are shown in Table 5.2.

No. FV DIM NS T1 T2 Accuracy DS

1 A 9 3 279955 130.6 58% 100

2 B 10 3 322573 144.4 68% 100

3 C 9 3 280055 129.6 71% 100

4 D 10 3 306726 153.3 73% 100

5 E 11 3 338727 180.2 78% 100

6 F 11 3 337732 182.2 66% 100

7 G 12 3 369972 217.0 80% 100

8 H 13 3 401978 248.9 76% 100

9 I 14 3 465998 300.2 75% 100

Table 5.2: Experimental results on different feature vectors. ”FV”, the feature vectorused. ”DIM”, the dimension of the feature vector. ”NS”, state number of the HMM.”T1”, time (ms) needed to train the 100 models. ”T2”, average time (ms) needed torun an isolated recognition. ”DS”, size of the HMM database the experiment based on.

Both feature vector ”A” and ”B” use the distances of finger tips to the hand center to

represent a static hand shape, the only difference between them is that B has an extra

element to capture the number of extended fingers in a frame. However, the usage of

the feature ”Extended Fingers Count” allows to improve the recognition accuracy by

10% in experiment 2, which makes us believe it is an important feature and should be

used in the subsequent tests. In experiment 3, instead of using distances to hand center,

we use the distances of adjacent finger tips. Although the dimension is decreased by 1,

the accuracy is improved by 3%, which means this kind of representation contains more

useful information. In the following experiments, we use different combinations of the

two kinds of representations, the highest accuracy 80% is reached when using feature

vector ”G”. We also test the feature ”Finger Openness” in experiment 6 by replacing

the feature ”Index to Hand” by ”Index Finger Openness” in experiment 5, it turns out

that the accuracy drops by 12%, which indicates that the so called high level feature

does not work in our system.

In theory, the higher the dimension of the feature vector, the higher the recognition

accuracy. This is true when the dimension of the feature vector is low as can be seen by

comparing experiment 5 and 7. However, when the dimension exceed some threshold,

the accuracy begins to drop as shown in the results of experiment 8 and 9. In section 3.3

of Chapter 3, we have discussed the numerical underflow problem and pointed out that


the computational complexity of Forward-Backward algorithm is 2TNT , where T is the

length of the recognition sequence and N the states number of the HMM. It seems that

the complexity has nothing to do with the dimension of the feature vector. Actually,

the formula holds only when the emission probability distribution is discrete. In our

system, the calculations involved in finding the probability that a mixture of multivariate

Gaussian distribution generates a specific feature vector are also needed to be considered.

This explains why we meet numerical underflow problem and obtain a drop of accuracy

when the dimension of feature vector increases.

In terms of computational time, we are not surprised to see that both the time needed

for training and isolated recognition increase as the dimension of the feature vector

increases. As a practical application, we care more about the isolated recognition time

because it can affect the speed of continuous recognition. It can be seen from the results

that the average time needed for running an isolated recognition when using feature

vector ”G” is 217ms which is much smaller that the time (600ms on average) needed to

record a recognition sequence.

5.2.1.2 Experiment on Number of States

In the previous section, we discussed the tests on feature vectors. We found an optimal

feature vector with low dimension while leading to relatively high recognition accuracy.

Considering all the tests were based on 3-state HMMs, a nature thought came into

our minds, how can the states number of the HMM affect the recognition in terms of

accuracy? To find an answer to this question, we carried out several experiments on the

number of states from 2 to 4 using two best-performing feature vectors from the last

experiment, ”E” and ”G”. The results are shown in Table 5.3.


10 E 11 2 190216 123.5 82% 100

5 E 11 3 338727 180.2 78% 100

11 E 11 4 524297 242.3 75% 100

12 G 12 2 208978 146.7 78% 100

7 G 12 3 369972 217.0 80% 100

13 G 12 4 612685 290.8 69% 100

Table 5.3: Experimental results on different number of states when using featurevector ”E” and ”G”.

From the results we can see that the accuracy for feature vector ”G” reaches its peak at

states number 3 and then starts to drop, and the same thing happens to feature vector


”E” even at a smaller state number 2. The reason why the accuracy does not increase

further is probably that the training data is not sufficient to train a model with a big

number of states. In our experiments, the training data for a model is fixed, if we want

to build a model with bigger state space, the feature vectors used to train each state

is decreased correspondingly. The model will be compromised when there is no enough

training data for each state.

5.2.1.3 Experiment on Size of HMM Database

When the size of the HMM database is 100, the highest recognition accuracy is 82%,

it appears when using feature vector ”E” and a 2-state HMM. The accuracy is still not

high enough to support continuous recognition. We believe that it is mainly because

of the big size of HMM database, a large database will increase the probability that a

transition has one or many similar transitions, which will significantly increase the risk

that a recognition sequence is recognized as its similar transitions. We try to narrow

down the size of the HMM database to figure out if the recognition accuracy can be

improved to a relatively high level (e.g., 95%).

In this experiment, we first reduce the size of the HMM database to 50 and then to

30. For each database size, we select 5 subsets from the 100 list randomly. The final

accuracy will be the average on the results of the 5 subsets. Table 5.4 shows the results

when testing on different database size using different HMM configurations.


14 E 11 2 98489 32.21 86.34% 50

15 E 11 2 58708 12.68 89.96% 30

16 G 12 3 187020 55.01 85.52% 50

17 G 12 3 113406 21.53 89.91% 30

Table 5.4: Experiments results on different size of HMM database.

Experiments 10 and 14 both use feature vector ”E” and 2-state HMMs, by comparing

the results we can see that the recognition accuracy is improved by more than 4% when

the size of the database is reduced to 50, and the accuracy is further boost to nearly

90% when a 30 models database is used. We obtain a similar result when using feature

vector ”G” and 3-state HMMs.


5.2.2 Continuous Recognition

Since the isolate recognition accuracy is promising when the size of the HMM database is

relatively small, we decided to test continuous recognition on small number of transitions.

In real world scenario, finger spelling is mainly used to represent names that are not

defined in German Sign Language. Therefore, we use the transitions contained in the

10 most common surnames in Germany to build our HMM database.

Muller, Schmidt, Schneider, Fischer, Weber,

Meyer, Wagner, Becker, Schulz, Hoffmann

The transitions extracted from the names are shown in Table 5.5.

MU UL LE ER SCHN NE ID DE

FI ISCH SCHE WE EB BE ME EY

YE WA AG GN EC CK KE SCHU

UL LZ HO OF FM MA AN -

Table 5.5: 31 letter to letter transitions in the ten names.

For the transitions that are not in the 100 transitions list (e.g., E to Y transition), we

record 10 training sequences as we do in the preparation phase. Each transition model

in the table is trained using feature vector ”G” and a 3-state HMM. We run 20 tests for

each name, and the correctly recognized number is listed in Table 5.6.

Name Correct NO. Total NO. Accuracy

Muller 13 20 65%

Schmidt 16 20 80%

Schneider 12 20 60%

Fischer 15 20 75%

Weber 12 20 60%

Meyer 13 20 65%

Wagner 14 20 70%

Becker 11 20 55%

Schulz 17 20 85%

Hoffmann 13 20 65%

Table 5.6: Experimental results on continuous recognition.


The name with highest recognition accuracy is ”Schulz”, because the transitions com-

posing it are unique and can not be confused with other transitions. On the contrary,

names containing similar transitions are easily misrecognized, for example, ”Schneider”

is usually recognized as ”Schmeider” because the posture for letter M and N are very

similar from the Leap Motion’s perspective. The overall accuracy for continuous recog-

nition is 68% which is much lower than isolated recognition, however, we believe the

accuracy can be improved when Bayesian inference [39] (take the probability of one

letter appears right after another into consideration) is applied.

5.3 Discussion

Through the experiments we have discovered that all the factors like the dimension

and elements of the feature vector, states number of the HMM and the size of HMM

database have big influences on the performance of the system. Another factor that

can not be neglected is the input device. During the experiments we can monitor the

skeletal modeled hands in the Leap Motion Diagnostic Visualizer and find out that the

device has highly varying performance under different circumstances.

The Leap Motion has high precision on static hand with separated fingers, but the data

for static hand with closed fingers are not satisfying. For example, the posture for letter

F can be perfectly captured while the posture for letter E is usually misrepresented by

the Leap Motion as shown in Figure 5.1 and 5.2:

The data for moving hands and fingers are always noisy, which is not a good news

because all the transitions contains movement of fingers. The problems are caused by

lost or wrong fingers tracking (e.g., one fully opened finger recognized as two fingers in

Figure 5.3), temporal finger occlusions, and inconsistent sampling frequency [7].

Figure 5.1: Sign ofletter F in the Leap

Motion visualizer.

Figure 5.2: Signs ofletter E is misrepre-

sented.

Figure 5.3: One ex-tended finger recog-

nized as two.

Many letters in the manual alphabet has very similar postures (e.g., the postures of A,

S, M, and N in Figure 1.1), the Leap Motion has difficulty in finding the differences


among them. Things become even worse when the transition itself is ambiguous, for

example, M to N transition only need a slight thumb position change. Therefore, it is

recommended to reduce the size of the database and include transitions that have visible

and separated fingers from the sensor’s perspective.

Chapter 6

Conclusion and Further Work

In this thesis, we introduce the German finger spelling recognition system (GFSR), a

system which is capable of recognizing continuous German finger spelling in real time.

The system uses the Leap Motion Controller to collect frames of data representing the

evolution of the user’s hand pose along time. In terms of letter representation, instead

of modeling static posture for each letter, letter to letter transition is modeled. The

transition can be seen as a dynamic gesture and is modeled using hidden Markov model,

which is a statistical model that can handle the variation of different signers.

By modelling letter to letter transitions, we can in theory recognize any finger spelled

words without limiting to a dictionary because it is the smallest unit to compose a

word. However, this also extends the number of models from 31 to 960. We ignore the

transitions that are not frequently used in a German corpus and decrease the number

to 100 which covers 80% of letter to letter transitions in German words formation.

The evaluation are conducted on both isolated recognition and world level continuous

recognition. For isolated recognition, the tests are on three aspects: the feature vector,

the states number of the HMM, and the size of the HMM database. When testing on

the 100 transitions, the result shows that the highest recognition accuracy 82% appears

using a 2-state left-right HMM (Table 5.3) and feature vector ”G” (Table 5.1). The

result does not seem promising enough for continuous recognition. This would most

probably be due to the big database size. After narrowing down the size of the database

to 30 transitions, the system can achieve accuracy of 89.96%. When it comes to the

task of continuous recognition, we focus on the fact that finger spelling is mainly used

to spell names, and run the tests on the transitions extracted from some most popular

surnames in Germany. It turns out that the system can recognize 10 names with an

accuracy of 68% due to the noisy data returned from the Leap Motion.

53

Conclusion and Further Work 54

Although the Leap Motion might not be the best choice in terms of dynamic German

finger spelling recognition. It has shown its value in some other HCI applications,

including isolated hand shape recognition [16, 18, 20] and computer games [40]. In fact,

our system is not restricted to finger spelling recognition, it can be used or extended to

recognize many kinds of dynamic gestures in general as long as a proper feature vector

is selected.

As further development, the system could be improved in many aspects. Since the data

returned by the Leap Motion is noisy, a proper pre-processing on the training data might

help to improve the system performance. Also, we can combine two or multiple devices

together, and data returned from different devices will be processed together using data

fusion methods to get more reliable information. Further experiments can be conducted

on a larger training data set to avoid the lack of training data problem when using big

state space HMMs. Training data can also be recorded from different signers to build a

user independent system which is more practical for real world.

Appendix A

JSON Structure from the Leap

JSON Frame data:

”currentFrameRate”: 105.122, ”gestures”: [], ”hands”: [ ”direction”: [ -0.0224476,

0.0632427, -0.997746 ], ”id”: 173, ”palmNormal”: [ -0.19773, -0.978564, -0.0575783 ],

”palmPosition”: [ -6.47867, 124.943, 12.5812 ], ”palmVelocity”: [ -1.86812, 1.93451, -

0.248464 ], ”r”: [ [ 0.94694, -0.124589, -0.29628 ], [ 0.005127, 0.927551, -0.373661 ],

[ 0.321369, 0.352315, 0.878974 ] ], ”s”: 1.21237, ”sphereCenter”: [ 5.55477, 211.896, -

32.5654 ], ”sphereRadius”: 102.235, ”stabilizedPalmPosition”: [ -3.9295, 130.376, 11.5911

], ”t”: [ -120.243, 15.6378, -37.449 ], ”timeVisible”: 8.41104 ], ”id”: 299775, ”interac-

tionBox”: ”center”: [ 0, 106.428, 0 ], ”size”: [ 125.185, 125.185, 78.6246 ] , ”pointables”:

[ ”direction”: [ -0.551342, -0.157666, -0.819246 ], ”handId”: 173, ”id”: 1730, ”length”:

46.3759, ”stabilizedTipPosition”: [ -76.1444, 116.216, -8.54233 ], ”timeVisible”: 8.41104,

”tipPosition”: [ -79.3754, 109.754, -6.69816 ], ”tipVelocity”: [ -1.17475, -1.49724, -

0.461121 ], ”tool”: false, ”touchDistance”: 0.285742, ”touchZone”: ”hovering”, ”width”:

18.0195 , ”direction”: [ -0.167748, -0.0766816, -0.982843 ], ”handId”: 173, ”id”: 1731,

”length”: 52.33, ”stabilizedTipPosition”: [ -35.9563, 135.028, -78.7522 ], ”timeVisi-

ble”: 8.41104, ”tipPosition”: [ -39.333, 131.466, -78.8971 ], ”tipVelocity”: [ -3.68745,

-2.67284, -0.0929271 ], ”tool”: false, ”touchDistance”: 0.260135, ”touchZone”: ”hov-

ering”, ”width”: 17.2122 , ”direction”: [ 0.0997289, -0.100053, -0.989971 ], ”handId”:

173, ”id”: 1732, ”length”: 59.6259, ”stabilizedTipPosition”: [ 3.4735, 131.115, -88.7222

], ”timeVisible”: 8.41104, ”tipPosition”: [ 0.964441, 127.355, -89.2757 ], ”tipVelocity”:

[ -2.40379, 1.80018, -0.512016 ], ”tool”: false, ”touchDistance”: 0.257204, ”touchZone”:

”hovering”, ”width”: 16.9047 , ”direction”: [ 0.160035, -0.0907604, -0.98293 ], ”han-

dId”: 173, ”id”: 1733, ”length”: 57.3319, ”stabilizedTipPosition”: [ 26.6008, 128.212,

-79.3807 ], ”timeVisible”: 8.41104, ”tipPosition”: [ 24.5168, 123.728, -79.6556 ], ”tipVe-

locity”: [ -1.59937, -0.517898, -0.349728 ], ”tool”: false, ”touchDistance”: 0.263316,

55

Appendices 56

”touchZone”: ”hovering”, ”width”: 16.0859 , ”direction”: [ 0.331746, -0.186307, -

0.924789 ], ”handId”: 173, ”id”: 1734, ”length”: 44.9471, ”stabilizedTipPosition”: [

49.5229, 117.765, -54.3087 ], ”timeVisible”: 8.41104, ”tipPosition”: [ 48.5701, 110.975,

-53.3266 ], ”tipVelocity”: [ -1.2279, -2.38346, 0.488928 ], ”tool”: false, ”touchDistance”:

0.272498, ”touchZone”: ”hovering”, ”width”: 14.2888 ], ”r”: [ [ 0.94694, -0.124589,

-0.29628 ], [ 0.005127, 0.927551, -0.373661 ], [ 0.321369, 0.352315, 0.878974 ] ], ”s”:

1.21237, ”t”: [ -120.243, 15.6378, -37.449 ], ”timestamp”: 98926509

Appendix B

The 100 Transitions

Token Prob. Token Prob. Token Prob. Token Prob.

EN 4.0568% LE 0.7881% RT 0.4706% MA 0.3277%

ER 3.7674% SS 0.7817% ZU 0.4706% UF 0.3148%

CH 2.6782% NS 0.7635% LL 0.4658% TR 0.3140%

DE 2.1959% IS 0.7591% AR 0.4615% EU 0.3098%

EI 2.1104% EL 0.7050% OR 0.4609% ZE 0.3082%

ND 2.0152% RA 0.6956% IG 0.4538% TU 0.3064%

TE 1.8183% LI 0.6548% WI 0.4477% LT 0.3047%

IE 1.6310% SI 0.6475% HR 0.4429% SO 0.3029%

IN 1.6138% RD 0.6184% ED 0.4326% TD 0.2947%

UN 1.6036% AL 0.5959% ET 0.4321% SA 0.2921%

GE 1.5880% TI 0.5645% NN 0.4190% NK 0.2759%

ES 1.3555% NA 0.5530% VE 0.4188% AB 0.2742%

ST 1.2160% WE 0.5496% LA 0.4108% OL 0.2708%

BE 1.1873% NT 0.5368% TS 0.4098% NZ 0.2686%

NE 1.1544% NI 0.5329% EH 0.4077% RB 0.2667%

RE 1.1250% DA 0.5281% MI 0.3995% HI 0.2633%

NG 1.1226% RS 0.5234% TA 0.3978% AC 0.2626%

HE 1.1026% AS 0.5233% RU 0.3870% RN 0.2624%

SE 1.0152% HA 0.5194% AT 0.3707% FE 0.2598%

IC 0.9892% HT 0.5194% EB 0.3681% TZ 0.2555%

AN 0.9604% ME 0.5166% NU 0.3662% NW 0.2552%

SC 0.9070% ON 0.5159% VO 0.3580% RG 0.2540%

DI 0.9024% RI 0.5003% EG 0.3559% IR 0.2518%


57

Appendices 58

Table B.1 – Continued from previous page

Token Prob. Token Prob. Token Prob. Token Prob.

AU 0.8383% US 0.4977% UR 0.3559% RK 0.2510%

IT 0.8306% EM 0.4804% KE 0.3327% IL 0.2491%

Table B.1: The 100 transitions with their occurrence probabilities.

Appendix C

Class Diagrams

Figure C.1: Class diagram of recognition.

59

Appendices 60

Figure C.2: Class diagram of training.

Appendices 61

Figure C.3: Class diagram of data recording.

Bibliography

[1] Carol Padden and Tom Humphries. Deaf in America: voices from a culture. Har-

vard University Press, Cambridge, Mass., 1988. ISBN 0674194233 9780674194236

0674194241 9780674194243.

[2] Vicki L. Hanson Matt Huenerfauth. Sign Language in the Interface: Access for

Deaf Signers.

[3] Carol Padden and Claire Ramsey. American sign language and reading ability in

deaf children. Language acquisition by eye, 1:65–89, 2000.

[4] Maryam Khademi, Hossein Mousavi Hondori, Alison McKenzie, Lucy Dodakian,

Cristina Videira Lopes, and Steven C Cramer. Free-hand interaction with Leap Mo-

tion Controller for stroke rehabilitation. In CHI’14 Extended Abstracts on Human

Factors in Computing Systems, pages 1663–1668. ACM, 2014.

[5] Leap Motion and Kinect used in computer vision, . URL http://artandtech.

aalto.fi/?page_id=1323.

[6] Tilak Dutta. Evaluation of the Kinect sensor for 3-d kinematic measurement in the

workplace. Applied ergonomics, 43(4):645–649, 2012.

[7] Matevz Pogacnik Saso Tomazic Joze Guna, Grega Jakus and Jaka Sodnik. An Anal-

ysis of the Precision and Reliability of the Leap Motion Sensor and Its Suitability

for Static and Dynamic Tracking. Sensors 2014, pages 3702–3720, 2014.

[8] The official website of the Leap Motion, . https://www.leapmotion.com/.

[9] P GOH. Automatic recognition of auslan finger spelling using hidden markov mod-

els. undergraduate, 2005.

[10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning,

20(3):273–297, 1995.

[11] Roberto Brunelli. Template matching techniques in computer vision: theory and

practice. John Wiley & Sons, 2009.

62

http://artandtech.aalto.fi/?page_id=1323

http://artandtech.aalto.fi/?page_id=1323

https://www.leapmotion.com/

Bibliography 63

[12] Brian D. Ripley. Pattern recognition and neural networks. Cambridge university

press, 1996.

[13] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications

in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

[14] Robert Y Wang and Jovan Popovic. Real-time hand-tracking with a color glove.

ACM Transactions on Graphics (TOG), 28(3):63, 2009.

[15] Official website of the microsoft kinect. https://www.microsoft.com/en-us/

kinectforwindows/.

[16] Tatiana Schmidt, Felipe P Araujo, Gisele L Pappa, and Erickson R Nascimento.

Real-time hand gesture recognition based on sparse positional data.

[17] Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R

news, 2(3):18–22, 2002.

[18] F. Zanuttigh P. Marin, G. Dominio. Hand gesture recognition with leap motion

and kinect devices. pages 1565 – 1569, 2014.

[19] James Davis and Mubarak Shah. Visual gesture recognition. In Vision, Image and

Signal Processing, IEE Proceedings-, volume 141, pages 101–106. IET, 1994.

[20] Jakub Wasikowski Katarzyna Zjawin Michal Nowicki, Olgierd Pilarczyk. Gesture

recognition library for leap motion controller.

[21] James Gosling. The Java language specification. Addison-Wesley Professional,

2000.

[22] Susanna Ricco. Carlo Tomasi. Fingerspelling Recognition through Classification of

Letter-to-Letter Transitions. Computer Vision – ACCV 2009, pages 214–225, 2010.

[23] Hidden markov model introduction from wikipedia. http://en.wikipedia.org/

wiki/Hidden_Markov_model.

[24] G David Forney Jr. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278,

1973.

[25] Douglas Reynolds. Gaussian mixture models. In Encyclopedia of Biometrics, pages

659–663. Springer, 2009.

[26] W. C. Stokoe. Sign language structure: An outline of the visual communication

system of the American deaf. Studies in linguistics, Occasional papers, 8, 1960.

[27] S. Liddell and R. Johnson. American Sign Language: The Phonological Base. Sign

Language Studies, 64:195–277, 1989.

https://www.microsoft.com/en-us/kinectforwindows/

https://www.microsoft.com/en-us/kinectforwindows/

http://en.wikipedia.org/wiki/Hidden_Markov_model

http://en.wikipedia.org/wiki/Hidden_Markov_model

Bibliography 64

[28] S. Liddell. Structures for representing handshape and local movement at the pho-

netic level. In University of chicagon press, editor, Theoritical issues in sign language

research, 1990.

[29] Sign Language IPA. URL http://dedalvs.free.fr/slipa.html#handshape.

PRIVATE=1.

[30] Dimitris Metaxas Christian Vogler. A Framework for Recognizing the Simultaneous

Aspects of American Sign Language. Computer Vision and Image Understanding

81, pages 358–384, 2001.

[31] Christian Vogler and Dimitris Metaxas. Handshapes and movements: Multiple-

channel asl recognition. In Lecture Notes in Computer Science, pages 247–258.

Springer, 2004.

[32] W Sandler. Representing handshapes. International Review of Sign Linguistics, 1:

115–158, 1996.

[33] Paul S Bradley and Usama M Fayyad. Refining initial points for k-means clustering.

In ICML, volume 98, pages 91–99. Citeseer, 1998.

[34] Jan Beirlant, Edward J Dudewicz, Laszlo Gyorfi, and Edward C Van der

Meulen. Nonparametric entropy estimation: An overview. International Journal of

Mathematical and Statistical Sciences, 6(1):17–39, 1997.

[35] An introduction of entropy estimation from wikipedia. URL https://en.

wikipedia.org/wiki/Entropy_estimation.

[36] D.B. Paul. Speech Recognition Using Hidden Markov Models. The Lincoln

Laboratory Journal, Volumn 3, Number 1, 1990.

[37] The website where we obtain a german corpus. http://korpora.zim.uni-due.

de/Leitseite/.

[38] Java implementation of Hidden Markov Model (HMM) related algorithms. https:

//code.google.com/p/jahmm/.

[39] David D Lewis. Naive (bayes) at forty: The independence assumption in informa-

tion retrieval. In Machine learning: ECML-98, pages 4–15. Springer, 1998.

[40] The website of the official game store where a lot of game applications controlled

by the leap motion can be found, . https://apps.leapmotion.com/.

http://dedalvs.free.fr/slipa.html#handshape

https://en.wikipedia.org/wiki/Entropy_estimation

https://en.wikipedia.org/wiki/Entropy_estimation

http://korpora.zim.uni-due.de/Leitseite/

http://korpora.zim.uni-due.de/Leitseite/

https://code.google.com/p/jahmm/

https://code.google.com/p/jahmm/

https://apps.leapmotion.com/

hidden markov model based recognition of german … · hidden markov model based recognition of...

Documents