hidden markov model based recognition of german … · hidden markov model based recognition of...
TRANSCRIPT
SAARLAND UNIVERSITY
Hidden Markov Model Based
Recognition of German Finger Spelling
Using the Leap Motion
Submitted by:
Tengfei Wang
Supervisor:
Dr.Alexis Heloir
A thesis submitted in partial fulfillment for the
degree of Master of Science
in the
Faculty of Natural Sciences and Technology I
Department of Computer and Communication Technology
September 2015
SAARLAND UNIVERSITY
Abstract
Faculty of Natural Sciences and Technology I
Department of Computer and Communication Technology
Master of Science
by Tengfei Wang
Recently, the appearance of novel acquisition devices like the Leap Motion Controller
drew a lot of attention in the field of gesture recognition. It is explicitly targeted
to hand gesture recognition and provides access to the position of the fingertips and
the orientation of the hand. This new device might be an interesting opportunity for
robust gesture recognition. We would therefore like to evaluate the capabilities of the
Leap Motion for recognizing complex gestures like the one that are used in German
finger spelling. In this thesis, we present the German finger spelling recognition system
(GFRS), which is capable of recognizing letter-to-letter transitions in real time. In this
system, instead of modelling static posture for each letter, letter to letter transitions
are modeled using hidden Markov models (HMMs). The models are trained using the
data recorded by the Leap Motion during the performance of transitions. In addition to
the statistical model, a bigram language model is also used to reduce the size of model
database. Experiments are conducted on both isolated and continuous recognition. For
isolated recognition, the system could achieve an accuracy of 80% using a vocabulary of
100 transitions and the number can be further improved to 89.96% when the vocabulary
size if reduced to 30. In terms of continuous recognitions, the accuracy is 68% when
testing on a vocabulary of 10 commonly used surnames in Germany.
Acknowledgements
This thesis required a significant amount of research and programming. The imple-
mentation would not have been possible without the support of many individuals and
organizations. Therefore I would like to extend our sincere gratitude to all of them.
First of all I am thankful to my supervisor, Dr.Alexis Heloir, for providing necessary
guidance concerning background knowledge as well as suggestions to solve problems that
are met during the project implementation. I would also like to show my gratitude to
Prof. Dr. Antonio Kruger, who accepted to review this work.
I am grateful to Deutsche Forschungszentrum fur Kunstliche Intelligenz (DFKI) for
provision of hardware support in the implementation. Additionally, I extend my thanks
to the Jahmm developers who provide the source code on which our project is based.
Last but not least, I would like to express my sincere thanks toward my family and my
friends for their kind encouragement, which helped me to complete this project.
ii
Contents
Abstract i
Acknowledgements ii
List of Figures v
List of Tables vii
Abbreviations viii
Symbols ix
1 Introduction 1
1.1 German Dactylology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Introduction to State of the Art Gesture Recognition . . . . . . . . . . . . 2
1.2.1 General Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Input Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Static Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Dynamic Gesture Recognition . . . . . . . . . . . . . . . . . . . . . 5
1.3 Our Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 The Proposed Method 8
2.1 The Leap Motion Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Letter to Letter Transition . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Basic Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Parameter Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Three Basic Problems for HMM . . . . . . . . . . . . . . . . . . . 12
2.3.3.1 Problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3.2 Problem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3.3 Problem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 HMM Used in Our System . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4.1 Topology - Left to Right . . . . . . . . . . . . . . . . . . 15
2.3.4.2 Emission Probability - Mixture of Multivariate Gaussians 15
iii
iv
3 System Architecture 17
3.1 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 The Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 HMM Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1.1 Sequence Segmentation . . . . . . . . . . . . . . . . . . . 22
3.2.1.2 Transition Probabilities (A) Initialization . . . . . . . . . 24
3.2.1.3 Emission Probabilities (B) Initialization . . . . . . . . . 24
3.2.2 Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 The Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 System Implementation 31
4.1 Jahmm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Main Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Extension to Jahmm . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Class Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Data Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.2 HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.2 Continuous Recognition . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Experimental Results 44
5.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Isolate Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1.1 Experiment on Feature Vector . . . . . . . . . . . . . . . 46
5.2.1.2 Experiment on Number of States . . . . . . . . . . . . . . 48
5.2.1.3 Experiment on Size of HMM Database . . . . . . . . . . 49
5.2.2 Continuous Recognition . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Conclusion and Further Work 53
A JSON Structure from the Leap 55
B The 100 Transitions 57
C Class Diagrams 59
Bibliography 62
List of Figures
1.1 German finger spelling alphabet from http://www.visuelles-denken.
de/Schnupperkurs3.html. . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 The right-handed Cartesian coordinate system of the Leap Motion. . . . . 9
2.2 The skeletal tracking model of the Leap Motion enables us to access theposition of each joint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The directional information represented by arrows that the Leap Motioncan provide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 The word ”owl” consists two transitions: o to w and w to l. Picturesource: http://www.cafepress.com/+finger-spelling+journals . . . 10
2.5 A sample of a 5-state hidden Markov model. . . . . . . . . . . . . . . . . . 11
3.1 Block diagram of system architecture. . . . . . . . . . . . . . . . . . . . . 17
3.2 A to B transition changes only the hand shape. . . . . . . . . . . . . . . . 19
3.3 Signs of M and N have slight differences. . . . . . . . . . . . . . . . . . . . 19
3.4 An illustration of the concept of finger openness. . . . . . . . . . . . . . . 19
3.5 Signs of D and G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Signs of I and J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 Flow chart of the training pipeline. . . . . . . . . . . . . . . . . . . . . . . 22
3.8 An illustration of entropy estimation with W=4. . . . . . . . . . . . . . . 23
3.9 Flow chart of the recognition procedure. . . . . . . . . . . . . . . . . . . . 28
3.10 A screen shot of the GBC matrix. . . . . . . . . . . . . . . . . . . . . . . 29
3.11 The occurrence probability of transitions in a 16026 sentences corpus. . . 29
4.1 Workspace directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 The main frame of GFSR. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 A popped up frame that gives instructions to the user about which tran-sition to perform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 The training panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 A popped up frame in which the user can configure the feature vector andstates number of the HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 The isolated recognition panel. . . . . . . . . . . . . . . . . . . . . . . . . 41
4.7 Three threads for continuous recognition. . . . . . . . . . . . . . . . . . . 43
5.1 Sign of letter F in the Leap Motion visualizer. . . . . . . . . . . . . . . . . 51
5.2 Signs of letter E is misrepresented. . . . . . . . . . . . . . . . . . . . . . . 51
5.3 One extended finger recognized as two. . . . . . . . . . . . . . . . . . . . . 51
C.1 Class diagram of recognition. . . . . . . . . . . . . . . . . . . . . . . . . . 59
v
vi
C.2 Class diagram of training. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
C.3 Class diagram of data recording. . . . . . . . . . . . . . . . . . . . . . . . 61
List of Tables
3.1 Candidate features used in our system. . . . . . . . . . . . . . . . . . . . . 21
3.2 Top ten most frequently used letter pairs . . . . . . . . . . . . . . . . . . 30
4.1 Classes used in our system from Jahmm. . . . . . . . . . . . . . . . . . . 33
4.2 Extra classes added to Jahmm . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 9 feature vectors used in the experiments. . . . . . . . . . . . . . . . . . . 46
5.2 Experimental results on different feature vectors. ”FV”, the feature vectorused. ”DIM”, the dimension of the feature vector. ”NS”, state numberof the HMM. ”T1”, time (ms) needed to train the 100 models. ”T2”,average time (ms) needed to run an isolated recognition. ”DS”, size ofthe HMM database the experiment based on. . . . . . . . . . . . . . . . . 47
5.3 Experimental results on different number of states when using featurevector ”E” and ”G”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Experiments results on different size of HMM database. . . . . . . . . . . 49
5.5 31 letter to letter transitions in the ten names. . . . . . . . . . . . . . . . 50
5.6 Experimental results on continuous recognition. . . . . . . . . . . . . . . 50
B.1 The 100 transitions with their occurrence probabilities. . . . . . . . . . . 58
vii
Abbreviations
GFRS German Finger Spelling Recognition System
HMM Hidden Markov Model
ASL American Sign Language
DGS Deutsche Gebarden Sprache
SVM Support Vector Machines
RF Random Forest
HIC Human Computer Interaction
FSM Finite State Machines
AFR Auslan Finger-spelling Recognizer
JSON JavaScript Object Notation
API Application Programming Interface
GMM Gaussian Mixture Model
HOC Hand Orientation Change
HPC Hand Position Change
WCSS Within-Cluster Sum of Squares
PDF Probability Density Function
EM Expectation-Mximization
GBC German Bigram Counter
IDE Integrated Development Environment
viii
Symbols
N number of states
T number of observations
St hidden state at time t
O a sequence of observations
Ot the observation at time t
aij the probability of the transition from state i to j
A the transition probability matrix
bj(Ot) the probability of observing k in state j at time t
B the emission probability of an HMM
λ the parameter set of an HMM
π the initial state distribution
ωi the ith component weight of a Gaussian Mixture Model
µi the mean vector of the ith component
Σi the covariance matrix of the ith component
ψ the parameter set of a GMM
OK the set of K observation sequences
O(i) the ith sequence of an observation sequences set
ix
To My Parents.
x
Chapter 1
Introduction
There are tens of millions of Deaf1 and hard of hearing people all over the world. For this
special group, sign languages are mainly used for communication purpose . However, the
linguistic structure of sign language is different from the linguistic structure of spoken
languages, such as grammar, vocabulary and word order [2]. Furthermore, many Deaf
people suffer from literacy deficiency [2, 3], it is hard for them to read text in a fluid
manner, because written text is often a transcription of oral language and Deaf have
huge difficulties acquiring the oral language (lack of acoustic feedback). All these facts
make them hard to learn and to communicate with the rest of the society. Therefore, a
system that can make written information more accessible to Deaf people is needed.
Recently, a number of human computer interaction (HCI) devices appeared on the mar-
ket (e.g., Kinect, the Leap Motion). These devices use field-proven methods to capture
human body’s motion. They are initially developed for gaming applications, but have
also been widely used in interaction research, rehabilitation [4], computer vision [5] and
3d reconstruction [6, 7]. One might however wonder if these devices can also be applied
to improve Deaf people’s life. Actually, companies such as ”Motion Savy”2 claimed that
they could develop an application capable of translating isolate word from American
Sign Language (ASL) to voice message in a conversational speed using the newly in-
troduced Leap Motion Controller [8]. However, the product has not come on market
yet and we could not find any scientific publications to support their claim. Thus, we
would like to evaluate how relevant the Leap Motion could be in the context of Sign
Language recognition. To make things easier, our focus will be on German finger spelling
recognition. Finger spelling is a small subset of Sign Language and consists of one or
1We follow the convention of writing Deaf with a capitalized ”D” to refer to members of the Deafcommunity [1] who use sign language as their preferred language, whereas deaf refers to the audiologicalcondition of not hearing
2The information of the company can be find on http://www.motionsavvy.com
1
Introduction 2
two-handed represented letters of the alphabet. It is mostly used to represent names
and technical terms that are not defined in the sign language vocabulary.
1.1 German Dactylology
There are 31 signs in German finger spelling alphabet (”Fingeralphabet” in German) as
shown in Figure 1.1. Besides the 30 basic letters, a very high frequently used combination
”sch” is also defined. German Sign Language (DGS, Deutsche Gebarden Sprache) uses
a one-handed alphabet. Most of the letters are represented by static postures except
”J”, ”Z”, ”A”, ”O”, ”U” and ”β”.
Figure 1.1: German finger spelling alphabet from http://www.visuelles-denken.
de/Schnupperkurs3.html.
1.2 Introduction to State of the Art Gesture Recognition
1.2.1 General Process
Generally speaking, there are two main tasks involved in gesture recognition [9]: feature
extraction and feature classification. A feature, in the context of finger spelling recog-
nition, is a quantity used to describe the static and dynamic properties of the hands
Introduction 3
performing the gesture in a specific frame. It can be global (e.g, position and motion of
the hand) or local (e.g, angle between two fingers, orientation of the hand). Usually, a
set of features are used together to characterize a frame and is called a feature vector.
The purpose of feature extraction is to find a feature vector (static gesture) or a set of
feature vectors (dynamic gesture) that are corresponding to a gesture. A mathematical
model that best describe the feature vector(s) is then built. The model is in the form
of equations which contain a set of parameters. We call the process of optimizing the
parameters as the model training.
Feature classification finds which gesture class the extracted feature vector(s) belong to.
For different classification tasks, different algorithms can be used. Some of the common
methods are Support Vector Machines (SVM) [10] , Template Matching [11], Neural
Networks [12] and Hidden Markov Model (HMM) [13].
1.2.2 Input Methods
Input method refers to the nature of the data that is acquired by the capture device
during the performance of a signer. The most common input methods have been used
for gesture recognition, in the past few years, are based on computer vision and sensors.
For vision-based method, one or several cameras are used to provide frames from the
captured video sequences [9]. To extract useful features from a specific frame, image
processing methods like segmentation need to be applied first. These algorithms might be
time consuming, which is not ideal for real-time gesture recognition. In addition, vision-
based methods have strict requirements on the environmental conditions, for example,
bad lighting conditions may affect image interpretation significantly and finally lead to
low recognition performances.
In terms of sensor-based method, cyber gloves [14] are a commonly used device. The
signer wears a glove equipped with sensors which can provide hand tracking information
on the position, rotation, movement and orientation of the hand. By using these kind of
gloves, features can be obtained directly from the information returned by the sensors
and image processing is no longer needed. However, wearing a glove full of wires while
performing a gesture is inconvenient and the device itself is quite expensive.
Recent consumer-range acquisition devices like the Leap Motion Controller [8] and the
Microsoft Kinect [15] draws a lot of attention in the field of gesture recognition. Compare
to the Kinect, the Leap Motion is explicitly targeted to hand gesture recognition and
allows us to access the position of the fingertips and the hand orientation directly, which
might be an opportunity for robust gesture recognition.
Introduction 4
1.2.3 Static Gesture Recognition
Static gesture recognition methods are used when the gesture is kept still during the
time window which is allowed for the recognition, for example, most letters in German
finger spelling alphabet. To recognize these kinds of gestures, classifiers like Template
Matching and Neural Networks can be used. The most important characteristic of static
gesture recognition is to figure out how the individual parts of the object (e.g., hands,
body) which performs a gesture are arranged in relation to each other. A lot of researches
have been conducted related to static gesture recognition.
Schmidt et al. [16] introduce a methodology for real-time static gesture recognition
capable of dealing with the sparse data provided by the Leap Motion Controller. The
system extracts features that can characterize the frame from the Leap Motion, the
whole feature vector F is composed by four vectors measure different aspect of the
hand:
• The first vector f1 is a 5-dimension vector contains the distance of each finger tip
to the hand center.
• The second vector f2 is a 4-dimension vector contains the angles between the
vector of adjacent fingers.
• The third vector f3 is a 5-dimension vector holds the angle between the finger
vector and the hand’s normal.
• The fourth vector f4 is a 2-dimension vector holds the radius of the sphere created
by the hand’s curvature and the number of fingers detected.
where the finger vector for finger i is computed by υi = pi − c, with pi the fingertip
position and c the center of the hand. They use F = {f1,f2,f3,f4} to train two
classifiers: SVM and Random Forest (RF) [17]. The experiments are based on a data set
with 11 different gestures performed by 6 users. The results show that their methodology
provides an accuracy of up to 94.26% when using RF with 100 trees and depth of 25,
and an accuracy of 89.64% when using SVM. When the paper was published, the Leap
Motion could only provide information of eight 3D points (the hand center, the positions
of five fingertips, the normal of the palm and the radius of the sphere created by the
hand’s curvature) in a frame. The data is extremely sparse, but the way how they use
sparse positional data to create a feature set is instructive.
The study of Marin et al. [18] is similar to the work of Schmidt and his colleagues [16].
They all focus on static gesture recognition using the Leap Motion. When choosing
Introduction 5
the feature set, apart from the angle and distance information, they also introduce the
concept of fingertips elevation which represents the distance of the fingertip from the
plane corresponding to the palm region (accounting also for the fact that the fingertips
can belong to any of the two semi-spaces defined by the palm plane). The feature
set is fed into a multi-class SVM classifier in order to recognize the performed gestures.
Combined with a set of depth computed features from the Kinect, the system can achieve
an accuracy of 75% on a 10 gestures database.
1.2.4 Dynamic Gesture Recognition
Considering the fact that most gestures used in daily communication and HCI are dy-
namic gestures, a lot of approaches have been proposed to deal with dynamic gesture
recognition. When extracting features, apart from the relation of individual parts of the
object performing the gesture, the variation in time also should be taken into consider-
ation. Common methods used to model dynamic gestures including HMM, Finite State
Machines (FSM) [19], etc. Some researches related to dynamic gesture recognition are
listed bellow.
In Paul Goh’s Phd. thesis [9], he presents the Auslan Finger-spelling Recognizer (AFR),
a system capable of extracting and recognizing finger spelled letters consisting of Auslan
manual alphabet letters from monocular video sequences. In his system, signed letter is
modeled using a single HMM due to the fact that Auslan finger spelling uses both hands
and is inherently dynamic. The system uses a single USB camera for image recording.
In the feature extraction phase, skin regions are detected and a set of features which
include geometric features and an optical flow motion descriptor are extracted from
video frames. The features he uses are:
• Geometric Features
– Left hand angel of orientation
– Left hand area
– Left hand major axis length
– Left hand minor axis length
– Right hand angle of orientation
– Right hand area
– Right hand major axis length
– Right hand minor axis length
Introduction 6
• Motion-base features
– X-velocity optical flow histogram (bins for ranges -4 to 4)
– Y-velocity optical flow histogram (bins for ranges -4 to 4)
The letter models can be obtained using isolated training and further refined using
embedded training. Tests are based on a vocabulary of twenty signed word, the results
show that the system has the best performance when using a finite state grammar
network and embedded training with an accuracy of 97% at the letter level and 88% at
the word level. Although the results are promising, the system is not a real time system,
which means it can only recognize gestures from pre-recorded video sequences.
A library for gesture recognition dedicated to the Leap Motion Controller is presented
in the bachelor’s thesis of Nowiciki et al. [20]. This library aims at helping developers
building applications using gestures as a human-computer interface. The library contains
two kinds of gestures: static gestures and dynamic gestures. For static gestures, the
recognition is based on SVM. They tested their system using different feature sets on
different size of vocabularies. The experimental results show that the system can achieve
an accuracy of 99% on a five gestures vocabulary and 85% on a ten gestures vocabulary
when using pre-processing to remove noise from the training data, and the feature set:
• Number of fingers in a frame.
• Distances between fingertips to the position of the hand’s palm.
• Angles between fingers and normal of hand’s palm.
• Five greatest values of angles between all combination of finger pairs.
• Five greatest values of fingertip distances between all combination of finger pairs.
When it comes to dynamic gestures, the recognition is based on HMM. In terms of the
feature vector, in addition to the features used in static gesture recognition, the speed of
the hand as well as the magnitude of the hand’s displacement are introduced. The best
recognition rate 80% appears when testing on a 6 gestures vocabulary using a 6-state
HMM. For the first time, their work include the recognition of dynamic gesture using
the Leap Motion. However, the recognition rate is not satisfying since the test is based
on a very small gesture vocabulary.
Introduction 7
1.3 Our Objective
From the related work we can see that most of the studies in the past are dedicated
to a small gesture vocabulary. In the literature so far, there is no such a system that
can recognize continuous finger spelling on-the-fly. Therefore, we want to evaluate the
Leap Motion by developing a German finger spelling recognition system (GFRS) which
is capable of recognizing continuous German finger spelling in real time.
Chapter 2
The Proposed Method
In this chapter, we present the methodology used to design and implement the German
finger spelling recognition system (GFRS). Firstly, a motion capture device namely the
Leap Motion Controller is introduced. We latter introduce its most interesting features
and the reasons why we choose it as our input device. Secondly, we will explain the
concept of letter to letter transition witch is used to build the statistical model. Lastly,
a tutorial on the fundamentals of HMM and how it can be adapted to our system is
given.
2.1 The Leap Motion Controller
The Leap Motion Controller is a relatively small device with dimensions of 3 x 1.2 x 0.5
inches and designed for HCI purposes. Its first version was released on May 21, 2012 by
Leap Motion, Inc.
The device uses optical sensors and infrared light to recognize and track hands, fingers
and finger-like tools with a frame rate of approximately 300 frames per second. It uses a
right-handed Cartesian coordinate system with the origin centered at the top as shown
in Figure 2.1. The sensors direct along the y-axis and have a field of view of about 150
degrees. The effective range of the Leap Motion ranges between 25 and 600 millimeters
above the device. Combining the data from the sensors and an internal built hand
model, the device can deal with challenging conditions (e.g., part of one hand is covered
by another).
One of the most appealing features of the Leap Motion is its skeletal tracking model
which provides bone information additional to hand palm and fingertips. Combining
with the coordinate system, we can easily access the positions of finger joints, center of
8
The Proposed Method 9
Figure 2.1: The right-handed Cartesian coordinate system of the Leap Motion.
bones and fingertips as marked by green balls in Figure 2.2. Apart from positional infor-
mation, the Leap Motion also provides directional information of fingertips and palms as
can be seen in the Leap Motion Diagnostic Visualizer in Figure 2.3. All the information
contained in a frame returned by the Leap Motion can be encoded in a JavaScript Object
Notation (JSON) as shown in Appendix A. The company also provides a dedicated ap-
plication programming interface (API) for different programming languages with which
developers can acquire the information they need by simply invoking a function. In our
project, we use the JAVA language [21].
Figure 2.2: The skeletal trackingmodel of the Leap Motion enables us
to access the position of each joint.
Figure 2.3: The directional informa-tion represented by arrows that the
Leap Motion can provide.
2.2 Letter to Letter Transition
Most letters composing the German finger spelling alphabet are represented by static
postures. The common methods used for recognizing a static posture is to capture a
single frame corresponding to the posture and then feed it to SVM [10] or other static
gesture classifiers. This is so called isolated letter recognition. However, we are looking
for a system that can continuously recognize letters in real time with a relatively high
speed (2 letters per second). To build a system based on isolate letter recognition, the
system needs to select which frames to classify into letters during the performance of a
signer, which is a difficult task within that shot period of time.
The Proposed Method 10
The study of Susanna Ricco and Carlo Tomasi [22] shows us a new perspective: words
can be seen consist of letter to letter transitions. For example, as shown in Figure 2.4,
the word ”owl” can be seen as a composition of two transitions: ”o to w” and ”w to l”.
Figure 2.4: The word ”owl” consists two transitions: o to w and w to l. Picturesource: http://www.cafepress.com/+finger-spelling+journals
If we model the transition instead of isolate letter, the static gesture recognition problem
is converted to a dynamic gesture recognition problem. During the performance of a
signer, there is no need to find a specific frame corresponding to a letter, which is an
error-prone process at conversational speed. On the contrary, all the frames recorded
during the performance of the transition can provide useful information on building the
model, which we believe is more reliable than isolate letter modelling. In addition, there
is no ”gap” between two transitions, which means the end of one transition is exactly the
start of another one. If we can find the end or the start points of transitions properly,
continuous recognition will be easier by just running consecutive isolated recognitions.
2.3 Hidden Markov Model
One of the biggest challenges of activity recognition is to deal with the variation of
different signers, in our case, this means that there are always more or less differences
between two signers when performing the same transition. These variations are mainly
caused by the habits of the signers and can be barely avoided. As a matter of fact, even
one signer can not perform exactly the same movement twice. Therefore, a statistical
model is appropriate. Another fact we need to consider is the variations occurring along
the time dimension, that is, dynamics of the gesture or dynamics of individual parts of
the hand performing the gesture. Hidden Markov model [23] is a statistical model that
The Proposed Method 11
can deal with these kinds of variations. Its state to state transition mechanism enables
it to capture the changes of a signal over time, which is ideal for activity recognition.
2.3.1 Basic Structure
A hidden Markov model (HMM) is a statistical model in which the system being mod-
eled is assumed to be a Markov process with unobserved states [23]. The diagram in
Figure 2.5 shows the basic structure of an HMM. Each circle represents a state, we use a
random variable St to denote the hidden state at time t (with the model in the diagram,
we have St ∈ {1, 2, 3, 4, 5} ). The random variable Ot is the observation1 at time t
generated by the current state with probability bj(Ot) (in the diagram, we have b2(Ot),
b3(Ot), b4(Ot) corresponding to state 2, 3 and 4 respectively). Given an observation Ot,
we can not tell by which state it is generated, this is where the name ”hidden” comes
from. The arrows in the diagram denote conditional dependencies, which means a state
can only be reached from the states that have arrows point to it. For example, state 4
can only be reached from state 2, state 3 and itself. The variable aij above the arrow is
the probability of the transition from state i to j, namely the transition probability.
The conditional probability distribution of the hidden variable St+1 at time t+1 (future
state) depends only on the value of the hidden variable St at time t (present state), i.e,
the values before time t are irrelevant. This is the so called Markov property. Similarly,
the value of the observed variable Ot only depends on the value of the hidden variable
St, both at time t.
Figure 2.5: A sample of a 5-state hidden Markov model.
1Observation is a basic term of HMM, in this thesis, ”observation” has the same meaning with”feature vector” as will be discussed latter. Similarly, training sequence and recognition sequence areboth a sequence of observations (observation sequence) in a general sense.
The Proposed Method 12
2.3.2 Parameter Set
The state space of the hidden variables is discrete, we use N (number of states) to denote
the number of elements in the space.
Generally speaking, there is a transition probability from each state to any state includ-
ing itself. If there is no transition between two states, the transition probability is simply
set to 0. Therefore, for an N-state HMM, there are N2 possible transition probabilities,
denoted by a N × N matrix A. The ith row of A satisfies the constrain:∑N
j=1 aij = 1
with 1 ≤ i ≤ N .
Unlike the state space, the observation space can be either discrete or continuous and is
decided by the nature of the observed variable. If the observed variables are from a finite
integer set, the observation space is discrete. On the other hand, the observation space
is continuous when the observed variables are high-dimensional vectors. For each state,
there is a probability distribution governing the distribution of the observed variable
at a certain time given the state of the hidden variable at that time. We use B to
denote emission probability distributions for all the states. In addition, for the standard
type of HMM, we can not tell which state generate the first observation. Thus, an
N-dimensional vector π is used to denote the initial state distribution. Therefore, the
complete parameter set of an HMM can be indicated by the compact notation:
λ = (A,B, π). (2.1)
2.3.3 Three Basic Problems for HMM
A classic paper written by Lawrence R. Rabiner in 1989 [13] gave a concrete methodical
review of HMM. It proposed three basic problems for HMM as well as the solutions in
a mathematical way, which is the basis of HMM based activity recognition applications
in the past decades.
2.3.3.1 Problem 1
Given the observation sequence O = O1O2 · · ·OT , and a model λ = (A,B, π), how do we
efficiently compute P (O|λ), the probability of the observation sequence, given the model?
This is the evaluation problem and can be solved efficiently by using the first (forward)
part of the Forward-Backward algorithm. In his tutorial [13], the problem is solved by
induction and is described as follow:
The Proposed Method 13
The forward variable αt(i) is defined first:
at(i) = P (O1O2 · · ·Ot, qt = Si|λ), (2.2)
which is the probability of the partial observation sequence, O1O2 · · ·Ot, until time t
and state Si at time t, given the model λ. Then at(i) can be solved inductively:
1. Initialization
α1(i) = πibi(O1), 1 ≤ i ≤ N. (2.3)
2. Induction
αt+1(j) =
[N∑i=1
αt(i)aij
]bj(Ot+1), 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N. (2.4)
3. Termination
P (O|λ) =
N∑i=1
αT (i). (2.5)
The solution will be used for recognition as will be illustrated in the next chapter.
2.3.3.2 Problem 2
Given the observation sequence O = O1O2 · · ·OT , and the model λ = (A,B, π), how
do we choose a corresponding state sequence which is optimal in some meaningful sense
(i.e., best explains the observations)?
The purpose of this problem is to uncover the hidden part of an observation. To solve
this problem, the Viterbi algorithm [24] is applied. Although this problem does not
provide direct support to our system, it helps to solve problem 3.
2.3.3.3 Problem 3
How do we adjust the model parameters λ = (A,B, π) to maximize P (O|λ)?
This is an estimation problem. As is discussed in the previous section, combining with
the ideas of solving Problem 1 and Problem 2, we can solve this problem by using an
iterative procedure such as Baum-Welch algorithm [13]. The solution tells us how to
build an HMM using an observation sequence. In the tutorial, there is also a concrete
description about how to solve this problem.
The Proposed Method 14
Again, some variables should be defined first:
1. βt(i), which is the probability of the partial observation sequence from t+ 1 to the
end, given state Si at time t and the model λ.
βt(i) = P (Ot+1Ot+2 · · ·OT |qt = Si, λ) (2.6)
2. γt(i), which is the probability of being in state Si at time t, given the observation
sequence O, and the model λ. It can be expressed by forward-backward variable:
γt(i) =αt(i)βt(i)
P (O|λ)=
αt(i)βt(i)N∑i=1
αt(i)βt(i)
(2.7)
The denominator here to make sure∑N
i=1 γt(i) = 1.
3. ξt(i, j), which is the probability of being in state Si at time t, and state Sj at time
t+ 1, given the model and the observation sequence.
ξt(i, j) =αt(i)aijbj(Ot+1)βt+1(j)
P (O|λ)=
αt(i)aijbj(Ot+1)βt+1(j)N∑i=1
N∑j=1
αt(i)aijbj(Ot+1)βt+1(j)
(2.8)
If we sum γt(i) over index from t = 1 to t = T −1,∑T−1
t=1 γt(i), then we get the expected
number of transitions from Si. Similarly, we do the same to ξt(i, j),∑T−1
t=1 ξt(i, j), we
get the expected number of transitions from Si to Sj . Using the two quantities, we can
form a set of formulas to re-estimate λ = (A,B, π).
The initial state distribution:
πi = γ1(i) (2.9)
The transition probabilities:
aij =
T−1∑t=1
ξt(i, j)
T−1∑t=1
γt(i)
(2.10)
The emission probabilities:
bj(k) =
T∑t=1s.t.Ot=vk
γt(j)
T∑t=1
γt(j)
(2.11)
The Proposed Method 15
It has been proven by Baum and his colleagues that we can always get an equally good
or a better fitted model λ when using the current model λ to calculate the right-hand
sides of 2.8-2.10.
2.3.4 HMM Used in Our System
2.3.4.1 Topology - Left to Right
There are many kinds of HMMs. By now we have discussed a fully connected HMM in
which a state can be reached from itself or every other state in a finite number of steps.
However, this kind of HMM is not appropriate for modeling time signals. In our system,
this time signal is a sequence of feature vectors recorded by the Leap Motion during
the performance of a letter to letter transition by a signer, we call it the observation
sequence or the training sequence. To build a more suitable model, a special type of
HMM namely left-right model or Bakis model is used, it has the property that the state
index increases as time goes on, which is capable of modeling signals whose properties
change over time.
For a left-right HMM, the transition probability matrix A is a upper triangular matrix,
and the probability that the first observation is generated by the first state is always
one, which means the initial state distribution π is fixed. In the later chapters, we will
use λ = (A,B) to denote the parameter set of the HMM used in our system.
2.3.4.2 Emission Probability - Mixture of Multivariate Gaussians
The emission probability distribution for each state of the HMM is modeled by a Gaus-
sian Mixture Model (GMM). A GMM is a parametric probability density function rep-
resented as a weighted sum of Gaussian component densities and is commonly used
as a parametric model of the probability distribution of continuous measurements or
features in a biometric system [25]. It has the ability to form smooth approximations
to arbitrarily shaped densities with a linear combination of a set of different Gaussian
distributions. Because the feature vector in out system is high-dimensional with contin-
uous elements, each component of the GMM is represented by a multivariate Gaussian
distribution. Mathematically, a mixture of M-component multivariate Gaussian model
is given by:
p(x) =
M∑i=1
ωig(x|µi,Σi), i ∈ {1, ...,M} (2.12)
The Proposed Method 16
where x is a D-dimensional continuous-valued feature vector, ωi are the mixture weights
satisfy the constrain∑M
i=1 ωi = 1, and g(x|µi,Σi) are the probability density functions
of the components. Each component density has the form:
g(x|µi,Σi) =1
(2π)D/2|Σi|1/2exp{−1
2(x− µi)
′Σi−1(x− µi)}, (2.13)
where µi is the mean vector and Σi is the covariance matrix. The complete model is
composed by the mixture weights of all components, the mean vectors, and the covari-
ance matrices. For convenience, the mixture model is denoted by:
ψ = {ωi,µi,Σi}, i ∈ {1, ...,M} (2.14)
Due to the fact that the overall feature density is composed by a set of Gaussian com-
ponents, full covariance matrices are no longer needed [25]. The correlations between
the elements of the feature vector can be modeled by a linear combination of Gaussians
with diagonal covariances. All the formulas from 2.3 to 2.8 contain the calculation of
emission probabilities, which indicates that a lot of matrix computations are involved.
The usage of diagonal covariance can help a lot to reduce computational complexity
during the training (solving Problem 3) and recognition (solving Problem 3).
Chapter 3
System Architecture
This chapter discusses the system architecture of our German finger spelling system
(GFRS) and is divided into four sections: the first section explains how we choose the
features to represent shape and dynamics of a hand, the second and third section give
details on model training and recognition respectively, the last section finally introduces
a linguistic tool namely Bigram and explains how it can be adapted to our system.
GFSR is composed of four modules: the feature extractor, the training module, the
recognition module and the HMM database. A block diagram in Figure 3.1 gives an
intuitive illustration on how the four modules work with each other. There are two
processing pipelines presented in the diagram. The first one is represented by red arrows,
we call it the training. Through the training pipeline, we get the HMM database, which
is a precondition of the second pipeline, namely the recognition and is represented by
green arrows.
Figure 3.1: Block diagram of system architecture.
17
System Architecture 18
3.1 Feature Extractor
Phonetics, in the case of sign languages, is concerned with the articulation of hands: the
various hand shapes, locations, orientations and movements. In 1960, William Stokoe
introduced the first phonetic representation system [26] to describe signs. In his system,
a sign is divided into three aspects (tab, dez, and sig) to describe position, configuration,
and motion of the hands respectively. In 1989, Liddell and Johnson [27, 28] pointed out
that Stoke’s system ignore the internal segmental structure of signs and proposed the
Movement and Hold model. In their model, signs are composed of Movement (M) and
Hold (H) segments. An M is a segment during which some aspect of the articulation is in
transition while an H is a segment in which all aspects of the sign is steady. There are also
some other well known transcription models, for example, SLIPA [29] (An International
Phonetics Alphabet for Signed Languages). All these models break signs into a set of
abstract parts which bring us the task: how to describe the abstract quantities with real
values so as to build a mathematical model?
In order to build a transition model, one needs first to record a sufficient number of
training sequences. Each training sequence contains a series of frames of tracking data
from the Leap Motion during the performance of a transition. The task of the feature
extractor is to take some of the features from a frame to form a vector to represent
the hand’s static and dynamic properties so as to distinguish from other frames, which
is the so called feature vector. As can be seen from system architecture, the input for
both training and recognition module is a sequence of feature vectors. We believe that
feature vector plays a crucial rule in our finger spelling recognition system and must
be selected carefully. Features are selected by analysing the properties of the signs in
German finger spelling alphabet. There are 31 signs in the alphabet, which gives us
31×30 = 930 (not including self transition) transitions in total. Some of the transitions
share common properties and can be divided into three categories.
• Category 1 - Only hand shape changes.
The simplest kind of transition contains no hand movement nor change of hand
orientation. This is due to the fact that both the signs of letters in the transition
pair are static and have the same palm direction. For instance, A to B transition
in Figure 3.2. During these kinds of transitions, only the hand shape is changed.
Thus, we only need to to find some features which can describe the static hand
arrangement, that is, how the individual parts of the hand are arranged in relation
to each other.
System Architecture 19
Figure 3.2: A to B transition changes only the hand shape.
Features like the distance between two fingertips as well as the distance of one
fingertip to a specific point of the hand (e.g., hand center) are usually used to
represent a hand shape [30]. Taking advantages of the Skeletal Tracking Model of
the Leap Motion, the position of the finger joints are also taken into consideration,
for example, the distance between the thumb fingertip and a finger joint can be
used to distinguish letters whose signs have only slight differences (e.g., M and N
in Figure 3.3). However, the features mentioned above are low level features [31].
Inspired by Sandler’s phonological model of hand shape [32], the degree of openness
of a finger is also chosen as a potential feature. In Figure 3.4 , the length of the
vertical red line is a measure of finger openness, the length is maximized when the
finger is fully closed and minimized when the finger is fully opened.
Figure 3.3: Signs of M and N haveslight differences.
Figure 3.4: An illustration of theconcept of finger openness.
• Category 2 - Hand shape and orientation change.
A number of signs only differ from the orientation of the hand (e.g. D and G
in Figure 3.5). If only the features derived from category 1 are used to form the
feature vector, then we can not tell the difference between the two transitions ”A
to D” and ”A to G”.
In order to distinguish these kind of transitions. Information on hand orientation
should be included in the feature vector. We call it ”Hand Orientation Change
(HOC)”. The orientation can change along different axis of the Leap Motion, thus
HOC is a three-dimension vector containing orientation change around X-axis
(HOC-X), Y-axis (HOC-Y) and Z-axis (HOC-Z). We take the beginning frame
of a transition as a reference and set all the elements to 0.0, the values in the
subsequent frames is the angle of rotation (from 0 to 180 degree) around the
rotation axis compare to the reference frame.
System Architecture 20
• Category 3 - Hand shape and position change.
The posture of letter ”J” are the same with letter ”I” except that it contains an
extra hand movement as shown in Figure 3.6. A double valued feature is used to
capture the hand movement between two frames. We use a very similar method
like we use in category 2 to get this feature. They have even similar names, namely
”Hand Position Change (HPC)”. We also choose the beginning frame as a reference
and set the corresponding value to 0.0, the value of HPC in the subsequent frames
is the distance of the hand center of current frame to the reference hand center.
Figure 3.5: Signs of D and G. Figure 3.6: Signs of I and J.
It is worth to be noticed that all the features are relative quantities. The absolute
position of the hand and fingers are not relevant as features because a gesture is always
the same no matter where it is made, but they can be used to obtain other meaningful
features. Combining all the features derived from the three categories, we get the final
feature vector. Table 3.1 lists all the candidate features from which we can choose any
combination to form a feature vector. The features from category 2 and 3 are essential in
order to recognize transitions with orientation and position change of the hand, while the
number of features used from category 1 has big influence on the accuracy of recognition.
On the one hand, if a small number of features from category 1 are used, then it is not
sufficient characterize a hand shape so as to distinguish similar transitions. On the other
hand, if we use too many features, then a larger training data set will be needed.
Number Code Feature Description Unit Category
1 TC Thumb to Hand Center mm 1
2 IC Index to Hand Center mm 1
3 MC Middle to Hand Center mm 1
4 RC Ring to Hand Center mm 1
5 PC Pinky to Hand Center mm 1
6 TI Thumb to Index mm 1
7 IM Index to Middle mm 1
8 MR Middle to Ring mm 1
9 RP Ring to Pinky mm 1
Continued on next page
System Architecture 21
Table 3.1 – Continued from previous page
Number Code Feature Description Unit Category
10 EFC Extended Fingers Count PC 1
11 IO Index Finger Openness mm 1
12 HOC-X Hand Orientation Change X-axis degree 2
13 HOC-Y Hand Orientation Change Y-axis degree 2
14 HOC-X Hand Orientation Change Z-axis degree 2
15 HPC Hand Position Change mm 3
Table 3.1: Candidate features used in our system.
3.2 The Training
The purpose of training is to build a database containing all the transition models. In
our system, each model is trained separately. Every training sequence for a transition
is composed by on average 60 frames returned from the Leap Motion during the perfor-
mance of a signer. To make the models more reliable, at least 10 training sequences are
used to train one model, which means the signer has to perform several times on the
same transition. Once the training sequences for one transition are prepared, a first ap-
proximated HMM is built. After that is an iterative procedure, taking the training data
and the approximated HMM as arguments to the Baum-Welch algorithm to re-estimate
transition probabilities (A) and emission probabilities (B), and we get a new HMM as
the output. The iterative procedure stops when the new model does not improve the
probability that the training sequences are generated by it compare to the model from
last iteration. Figure 3.7 gives an intuitive illustration of the procedure as a flow chart.
3.2.1 HMM Initialization
According to Rabiner’s tutorial [13], the Baum-Welch Algorithm leads to local maxima
only. In addition, when the dimension of the feature vector is high or the observation
sequences are very long, it might take very long time to meet the convergence. To
improve efficiency, only a certain number of iterations are applied during the training.
In order to obtain a more suitable model within a limited number of iterations, we pay
more attention on the first approximation of the HMM.
Actually, there is no straightforward methods to initialize an HMM. Experiments shows
that both random and uniform initial estimates of A can give useful re-estimation result
System Architecture 22
SequenceSegmentation
Initial HMMTraining Sequences
Baum-WelchAlgorithm
New λ = (A,B)
Convergence?
Final λ
yes
no
Figure 3.7: Flow chart of the training pipeline.
in most applications. However, an appropriate initialization of B is essential for HMMs
with a mixture of continuous emission probability distribution. This can be done by
using sequence segmentation.
3.2.1.1 Sequence Segmentation
Given several training sequences, for an N-state HMM, the very first thing to do to
initialize it is to segment each training sequence into N groups, and each group is labeled
by number from 1 to N. When all the training sequences are segmented, all groups labeled
by same number will be merged to one group. Finally, we allocate each group to the
corresponding state and use them to initialize emission probability distributions. The
problem is, how do we segment training sequences in a meaningful sense?
There are a number of methods dealing with sequence segmentation, one of the famous
methods is called K-mean clustering [33]. The aim of K-mean clustering is to partition
System Architecture 23
a group of observations into K sets so as to minimize the within-cluster sum of squares
(WCSS). However, this method does not maintain the time evolving property of the
observation sequence. In our system, the histogram-based entropy estimation [34, 35] is
used, which manually segment a sequence into states with averaging observations within
states. During the segmentation, entropy is used to measure the amount of similarity
between streams of feature vectors. The lower the similarity (big entropy), the more
probably it is a segmentation point. The process contains the following steps to segment
one training sequence:
1. Choose a window size W, put the first W feature vectors of the sequence in the
window as shown in Figure 3.8. Build a GMM with probability density func-
tion (PDF) f(x) using the expectation-maximization (EM) algorithm as will be
illustrated in section 3.2.1.3 on the W feature vectors.
2. Compute the entropy H(X) of the W feature vectors using the formula:
H(X) = −W∑i=1
f(xi)logf(xi). (3.1)
Where xi is the ith feature vector in the window.
3. Move the window every time by one feature vector, repeat step 1 and 2 until
the window reaches the end of the training sequence. Then we get a sequence of
entropies.
4. Find the biggest N−1 peak values in the entropy sequence, and the corresponding
indices of the training sequence are the segmentation points. If there are less than
N − 1 peaks, the training sequence will be segmented equally.
The result of sequence segmentation will be use to initialize B.
Figure 3.8: An illustration of entropy estimation with W=4.
System Architecture 24
3.2.1.2 Transition Probabilities (A) Initialization
To initialize A, a proper value in the interval [0, 1] should be placed in aij . The initial-
ization contains the following steps:
1. Due to the fact that A is a upper triangular matrix, put value 0 to aij with i > j.
2. The transition probability from the current state to its next reachable state is
initialized with the value p and is obtained by:
First, compute the average number of observations of all training sequences O
= [O(1),O(2),· · · ,O(k)] using:
Onum =
k∑i=1
O(i)num
k, (3.2)
where k is the number of training sequence, andO(i)num is the number of observations
in the ith sequence. p can be calculated using Onum and the number of state N:
p =1
Onum/N, (3.3)
3. For aij with i >= 0, j = i+ 1, j < N − 1, put value p.
4. For aii with i < N − 1, put value 1− p.
5. For aii with i = N − 1, put value 1, because there is no transition to other states
from the last state.
The initialized transition matrix should have the form:
AN,N =
1− p p 0 · · · 0
0 1− p p · · · 0...
.... . .
. . ....
0 0 0 · · · 1
(3.4)
3.2.1.3 Emission Probabilities (B) Initialization
After sequence segmentation, each state gets a set of training feature vectors. We want
to find a GMM, ψ, which in some sense best matches the distribution of the training
feature vectors.
System Architecture 25
Formally, the problem can be described as: given a set of T training feature vectors
O = {o1,o2, · · · ,oT }, how to find the model parameters which maximize the likelihood
of the GMM, P (O|ψ)?
p(O|ψ) =T∏t=1
P (ot|ψ). (3.5)
The parameters can be obtained iteratively using the EM algorithm. The EM algorithm
works as follow: begin with an initial estimate ψ (e.g., ramdom), use it to estimate a new
model ψ such that P (O|ψ) ≥ P (O|ψ). The new model ψ is then treated as the initial
model in the next iteration. The iteration terminated when the convergence criterion is
met.
During each iteration, the model parameter is estimated by the following formulas:
Mixture Weights
ωi =1
T
T∑t=1
Pr(i|ot, ψ). (3.6)
Means
µi =
T∑t=1
Pr(i|ot, ψ)ot
T∑t=1
Pr(i|ot, ψ)
. (3.7)
Variances
σ2i =
T∑t=1
Pr(i|ot, ψ)o2t
T∑t=1
Pr(i|ot, ψ)
− µ2i , (3.8)
where xt, µi, and σ2i refer to elements of xt, µi, and σ2i , respectively.
The posteriori probability for component i is given by:
Pr(i|ot, ψ) =ωig(ot|µi,Σi)
M∑k=1
ωkg(ot|µk,Σk)
(3.9)
The formula used above guarantee a monotonic increase in the model’s likelihood value,
and the value can be improved significantly after a few iterations. Therefore, in the
initialization phase, only a certain number of iterations are performed.
System Architecture 26
3.2.2 Baum-Welch Algorithm
In section 2.3.3 of Chapter 2, we have discussed the general Baum-Welch algorithm
which can be used to train an HMM. Formula 2.9-2.11 use only one training sequence
to estimate the parameters of a fully connected HMM. However, in our system, we are
using left-right HMMs and multiple training sequences. The estimate formula should be
refined.
Given a set of K observation sequences O = [O(1),O(2), · · · ,O(k)], where O(k) =
[O(k)1 O
(k)2 · · ·O
(k)Tk
] is the kth observation sequence. Under assumption that all the obser-
vation sequence are independent of each other, our goal is to find the parameter set λ
which maximizes
P (O|λ) =
K∏k=1
P (O(k)|λ) =
K∏k=1
Pk. (3.10)
Due to the fact that the re-estimation procedure is based on the occurrence of var-
ious events, the re-estimation for multiple observation sequences is just summing up
individual occurrence frequencies of each sequence, given by:
Transition Probabilities
aij =
K∑k=1
1
Pk
Tk−1∑t=1
αkt (i)aijbj(O
(k)t+1)β
kt+1(j)
K∑k=1
1
Pk
Tk−1∑t=1
αkt (i)βkt (j)
. (3.11)
To re-estimate the emission probabilities is actually to re-estimate the parameters of
GMMs. In the initialization phase, only a set of feature vectors are used to train GMM
parameters. In the phase of Baum-Welch algorithm, we use transition probability to
adjust the parameters so that P (O|λ) can be maximized. The re-estimation formulas
are given by:
Mixture Weights
cjn =
K∑k=1
1
Pk
Tk∑t=1
γkt (j, n)
K∑k=1
1
Pk
Tk∑t=1
M∑n=1
γkt (j, n)
. (3.12)
Mean Vectors
System Architecture 27
µjn =
K∑k=1
1
Pk
Tk∑t=1
γkt (j, n) ·O(k)t
K∑k=1
1
Pk
Tk∑t=1
γkt (j, n)
. (3.13)
Covariance Matrices
Σjn =
K∑k=1
1
Pk
Tk∑t=1
γkt (j, n) · (O(k)t − µk
jn)(O(k)t − µk
jn)′
K∑k=1
1
Pk
Tk∑t=1
γkt (j, n)
. (3.14)
where γkt (j, n) is the probability of being in state j at time t with the nth mixture
component accounting for O(k)t , given by:
γkt (j, n) =
αkt (j)βkt (j)
N∑j=1
αkt (j)βkt (j)
cjnψ(Ok
t ,µkjn,Σ
kjn)
M∑m=1
cjmψ(Okt ,µ
kjm,Σ
kjm)
. (3.15)
3.3 The Recognition
The precondition to run the recognition is that there are trained transition models in
the HMM database. The recognition procedure can be described as:
1. The user performs a transition in the effective area of the Leap Motion, an obser-
vation sequence O that contains a set of feature vectors is returned for recognition
and is called the recognition sequence.
2. For every HMM λi in the database, use the forward-backward algorithm as de-
scribed in section 2.3.3 of Chapter 2 to calculate Pi = P (O|λi), which is the
probability that O is generated by model λi.
3. Find the biggest probability value P , and the corresponding model λ is the recog-
nized transition.
During the training, we prefer to use multiple long training sequences because that a
big training data set can help to build a more reliable model. However, this dose not
System Architecture 28
Recognition Sequence
Forward-BackwardAlgorithm
HMM Database
(Pi, λi); i++
i >databasesize?
λ with biggest P
λi
no
yes
Figure 3.9: Flow chart of the recognition procedure.
apply for recognition sequence. In fact, the average length of the recognition sequence
in our system is only 10 to avoid the numerical underflow problem [36]. The computa-
tional complexity of the forward-backward algorithm is 2TNT (with T the length of the
sequence and N the states number of the HMM), which contains N(N + 1)(T − 1) +N
multiplications. It can be seen from the formula that the number of multiplication is
proportional to the length of the sequence. Because the multipliers are probabilities
with value located in the interval [0,1], the result becomes smaller and smaller as the
number of multiplications increases and finally lead to computer underflow.
3.4 Bigram
A bigram is a sequence of two adjacent elements in a string of tokens, which are typically
letters, syllables, or words. In our system, a bigram refers to every letter to letter
transition. As is discussed in the previous section, there are 930 possible transitions
if self transitions is not considered. To recognize a transition, the Forward-Backward
System Architecture 29
algorithm will be applied 930 times to find the best fit model. Clearly, this is too
expensive for real time gesture recognition system.
In fact, not all the transitions can occur in German word formation. To remove the
unnecessary transitions and minimize the size of the HMM database, a program called
the German Bigram Counter (GBC) to analyse the occurrence probability of transitions
in some corpora has been implemented. The program reads corpora in the form of
textual files from local folder, and shows the statistical result in a 31 × 31 matrix as
shown in Figure 3.10.
Figure 3.10: A screen shot of the GBC matrix.
Figure 3.11 shows the statistical curve when testing on a German corpus which con-
tains 16026 sentences downloaded from a corpora website [37], all the sentences are
extracted from some literary works written in German. We can see that the curve drops
dramatically after a certain point. This indicates that the transitions after that point
barely occur in German word formation. If these transitions are ignored, we get the
most frequently used 100 transitions which covers 80% of letter to letter transitions in
German.
Figure 3.11: The occurrence probability of transitions in a 16026 sentences corpus.
System Architecture 30
Table 3.2 lists the top ten most frequently occurred letter pairs. As can be seen from
the table, letter pair ”EN” wins the competition, this actually makes sense because all
the verbs in German are ended with ”EN”. The complete list that contains all the 100
transitions can be found in Appendix B.
Transition Probability(%)
EN 4.057
ER 3.767
CH 2.678
DE 2.196
EI 2.110
ND 2.015
TE 1.818
IE 1.631
IN 1.614
UN 1.604
Table 3.2: Top ten most frequently used letter pairs
Chapter 4
System Implementation
The German finger spelling recognition system (GFSR) is developed using Eclipse1 on
a laptop with a 32-bit windows 7 system installed and makes use of Java library in the
Leap Motion software development kit (SDK) and an HMM toolkit called Jahmm [38].
It has been tested that the system can run on Linux and MAC OS.
This chapter describes details on the system implementation and is divided into four
sections. The first section introduces the HMM library we use. The second section
explains how the functions are implemented in terms of class diagram. The third section
gives a description on the user interface. In the last section, we present some obstacles
encountered during the implementation as well as their solutions.
4.1 Jahmm
Jahmm is a Java library that contains the implementation of HMM related algorithms.
It is mainly designed for research and teaching purposes. Therefore, all the algorithms
are implemented in a general manner according to the theory, and the code can be easily
understood and modified to adapt to different applications. There are six packages in
Jahmm and each package has different function:
• run.distributions implements various pseudo-random distributions.
• run.jahmm is an HMM implementation.
• run.jahmm.draw helps drawing HMM-related objects.
1Eclipse is an integrated development environment (IDE). It contains a base workspace and anextensible plug-in system for customizing the environment. Eclipse SDK is free and open source softwareunder the terms of the Eclipse Public License. Different releases can be found in http://www.eclipse.
org/downloads/, and we use the Indigo release version.
31
System Implementation 32
• run.jahmm.io holds classes that read and write HMM-related objects.
• run.jahmm.learn holds HMM-related learning algorithms.
• run.jahmm.toolbox holds HMM-related tool algorithms.
4.1.1 Main Classes
The most basic and important class of Jahmm is Hmm, it contains all the elements of
an HMM: number of states, state to state transition probabilities, emission probability
for each state, and a bunch of methods to set and return the parameters of an HMM.
The class Observation defines the observation of an HMM. This observation can be dis-
crete or continuous, it can be an integer or a double valued vector. Jahmm implements
some commonly used observations:
• The ObservationInteger class holds integer observations.
• The ObservationDiscrete class holds observations whose values are taken out of
a finite set.
• The ObservationReal class holds real observations (implemented as a double).
• The ObservationV ector class holds vector of real values (implemented as doubles).
A sequence of observations is simply implemented as a vector of observations. A set
of observation sequences is implemented using a vector of such vectors. To be useful,
each kind of observation should have at least one observation probability distribution
function. For example, the ObservationV ector class can be used together with the
class OpdfMultiGaussian, which implements a multivariate Gaussian distribution. The
Viterbi, Forward-Backward, and Baum-Welch algorithms are implemented in the class
V iterbiCalculator, ForwardBackwardCalculator, BaumWelchLearner respectively.
Table 4.1 lists the classes of Jahmm that are used in our system:
Package Class
run.jahmm
ForwardBackwardCalculator
Hmm < OextendsObservation >
ObservationV ector
V iterbiCalculator
Continued on next page
System Implementation 33
Table 4.1 – Continued from previous page
Package Class
run.jahmm.io
HmmReader
HmmWriter
ObservationV ectorReader
ObservationV ectorWriter
OpdfWriter < OextendsOpdf <? >>
OpdfReader < OextendsOpdf <? >>
run.jahmm.learn BaumWelchScaledLearner
Table 4.1: Classes used in our system from Jahmm.
4.1.2 Extension to Jahmm
Although Jahmm has already implemented some commonly used state emission prob-
ability distributions, the mixture of multivariate Gaussian distribution that is required
in our system is not one of them. Therefore, based on the already existing distributions,
some classes related to Gaussian mixture model is added to the corresponding packages
as shown in Table 4.2.
Package Class
run.jahmm OpdfMultiGaussianMixture
OpdfMultiGaussianMixtureFactory
run.jahmm.io OpdfMultiGaussianMixtureReader
OpdfMultiGaussianMixtureWriter
run.distributions MultiGaussianMixtureDistribution
Table 4.2: Extra classes added to Jahmm
4.1.3 Data Storage
The data used in our system is stored on the local disk of the computer running it, we
call it the workspace and name it by GFRworkspace. The first task of the system when
it is launched is to check if the GFRworkspace folder and its subfolders exist, if not,
the system will build new ones. The directory structure of the workspace is shown in
Figure 4.1.
The workspace contains five folders and the data in different folders are used for different
purposes.
System Implementation 34
GFRworkspace
Training Sequences
HMM Database
Backup
Training
HMM
TestLogs
Test Sequences
Figure 4.1: Workspace directory.
• Training Sequences
When the system is running, all the newly recorded training sequences are stored
in this folder. The sequences used to train one transition model are encoded in
one file. The user can only use data from this folder to train the models. The
folder will be cleaned up and all data files will be moved to Backup folder when
the system exits.
• HMM Database
When using data from Training Sequences to train models, the trained models are
put into this folder. Each model is corresponding to one file. Recognition can only
base on the models in this folder. Together with Training Sequences folder, they
are called the current workspace. Also, when the system exits, this folder will be
copied to Backup folder.
• Logs
When one isolate recognition is run, a textual log file recording the computation
process will be stored in this folder. The logs are mainly used for debugging
purpose.
• Backup
Files from folder Training Sequences and HMM Database can be stored in this
folder as backup. The user can also import them from Backup folder to current
System Implementation 35
workspace so that he/she do not need to record training data or build HMM
database by his/her own.
• Test Sequences
This folder contains recognition sequences (test sequences) used for testing pur-
pose. Each file in the folder contains exactly one sequence corresponding to a
transition. Combining with models in HMM Database, we can evaluate the sys-
tem in terms of recognition accuracy.
There are mainly two kinds of data needed to be stored in the workspace: the observation
sequence data (including the training sequences and test sequence), and the trained
HMMs. Both are encoded in textual files.
Training sequences are written to a textual file by class ObservationWriter. One file
contains the necessary sequences to train a model, for example, 3 training sequences
with different length are stored with the form:
obs11 ; obs12 ; obs13 ;
obs21 ; obs22 ; obs23 ; obs24 ; obs25 ;
obs31 ; obs32 ; obs33 ; obs34 ;
The file is named by the transition letter pair and has extension ”.seq”. For exam-
ple, the file that contains the training sequences of A to B transition is named by
”AB.seq”. Training sequences can be read from the corresponding file using class
ObservationReader. As for test sequence file, the only difference from training se-
quences file is that it contains only one observation sequence.
In terms of HMM, the syntax is quite straightforward. For a 4-state HMM with integer
emission probability distribution for each state, the textual file should be like:
Hmm v1.0
NbStates 4
State
Pi 0.3
A 0.1 0.2 0.3 0.4
IntegerOPDF [0.2 0.8 ]
State
Pi 0.3
A 0.2 0.4 0.2 0.2
IntegerOPDF [0.5 0.5 ]
State
Pi 0.2
A 0.3 0.3 0.2 0.2
System Implementation 36
IntegerOPDF [0.3 0.7 ]
State
Pi 0.2
A 0.2 0.2 0.4 0.2
IntegerOPDF [0.5 0.5 ]
It is worth point out that all the white spaces between lines and words are treated as
one space, in other words, the whole file can contain only one line. When read from the
file using class HmmReader, the parameters are recognized according to the keywords
before them.
The first line just gives the version number of the file syntax. The keyword NbStates
gives the number of states of the HMM described. After that comes the descriptions of
lists of states, the states appears in order and each state is composed of:
• State indicates the start of a state description.
• Pi is the probability that this state is an initial state.
• A is a list that contains the state to state transition probabilities. If the state
currently described is numbered by i, then the jth probability of the list is that of
going from state i to state j.
• IntegerOPDF is the state emission probability distribution. The syntax depends
on the type of distributions. The example given above describes integer distribu-
tions. It begins with the IntegerOPDF keyword followed by an ordered list of
probabilities between brackets. In the example above, if the first probability is
related to the integer ”0”, the second to ”1”, then the probability that the first
state emits ”1” is equal to 0.8.
Like the sequence files, we add extension ”.hmm” to HMM files, and the files are also
named by the names of transitions.
4.2 Class Diagrams
The whole project is composed by 5 packages:
• userInterface contains all the classes related to graphic user interface of GFRS.
• dataCollection holds classes that are responsible for sequence data recording, in-
cluding training sequences and recognition sequences.
System Implementation 37
• hmm contains classes that implement interfaces of Jahmm as well as training
related operations.
• recognitionModule contains recognition related classes.
• util implements tools (e.g., sequence segmentation) needed to support different
functions.
Classes from different packages are used together to implement three main functions:
recording, training and recognition. The class diagrams of the three functions are repro-
duced in Appendix C. The source code of this project is available on the github server:
https://github.com/TengfeiWang/GermanFingerspellingRecognizer.
4.3 User Interface
Figure 4.2 shows the main frame of GFSR. The frame can be divided into three parts:
the menu bar on the top, the button list area on the left and the functional area on the
right.
Figure 4.2: The main frame of GFSR.
In the functional area, different panels will be switched to each other based on button
clicking operations. When the system launched at the first time, the user can only access
the recording panel while the 2:Train and 3:Recognition button are disabled due to the
System Implementation 38
fact that there is no training data or trained models in the current workspace. Once
training data is recorded, the training is activated and then the recognition.
4.3.1 Data Recording
The recording panel contains two parts as shown in Figure 4.3, one part is the entrance
to the recording procedure and another allows the user to access the recorded data.
All the transition that need to be modeled are pre-stored in a hash map called AllTran-
sitions with the transition name as the key and the end letter of the transition as its
value. For example, A to B transition are stored in the form of (AB,B). When the
recording is started, the system checks how many transitions have already been recorded
in Training Sequences folder and deletes them from AllTransitions, then we get a new
hash map UnrecordedTransitions which contains all the unrecorded transitions. The
transitions in UnrecordedTransitions will be recorded alphabetically, the user just has
to follow the instruction of the popped up frame in Figure 4.3.
Figure 4.3: A popped up frame that gives instructions to the user about whichtransition to perform.
The recording will stop when any of the three conditions are met:
• All the transitions in UnrecordedTransitions have been recorded.
• No hands detected by the Leap Motion.
• The user manually stops the recording by press any key on the keyboard.
System Implementation 39
4.3.2 HMM Training
The user has no access to the training panel unless there is training data in folder
Training Sequences. Training data can be obtained either by starting the recording
procedure or by importing from Backup folder. The training panel contains three parts
as shown in Figure 4.4: the elements of currently used feature vector are list on the top,
the middle part contains buttons to start the training procedure as well as the access to
the HMM database, on the bottom is a monitor which allows the user to get information
about the training process (i.e., how many models are successfully trained).
Figure 4.4: The training panel.
It is worth point out that the feature vector recorded in the data collection phase con-
tains all the features listed in Table 3.1 and is called a full feature vector. All the
possible feature vectors can be obtained by extracting corresponding elements from the
full feature vector, which makes sure that there is no need to collect training data again
when we want to change the feature vector.
Before the training, a new frame as shown in Figure 4.5 pops up and ask the user to
specify a feature vector and states number of the HMM to be used in the training phase.
The user can choose any combinations of features in the check boxes, the red lines drawn
in the skeletal model on the right part of the frame give intuitive illustration about what
the features measure. The user can also specify the states number of the HMM that are
about to be trained in this frame, the value is 5 by default. Once the feature vector and
System Implementation 40
states number are settled, the system starts to prepare training sequences by extracting
feature vectors from the full ones.
Figure 4.5: A popped up frame in which the user can configure the feature vectorand states number of the HMM.
4.3.3 Recognition
Similar to the training, the user can run recognition when there are trained HMMs in
folder HMM Database. The models can be obtained either by starting the training
procedure as described in the last section or by importing from Backup folder. When
the models are imported, the system can also obtain the information on what feature
vector as well as the states number of the HMM used to build the models by reading a
textual file. The textural file is generated every time when the models in folder HMM
Database are copied to folder Backup.
The system implements two kinds of recognition: isolated and continuous recognition.
The panel for isolated recognition is shown in Figure 4.6. The user click the Run
Recognition button on the top to record a recognition sequence which is then sent to the
recognition pipeline. The computational process, that is the probabilities the sequence is
generated by each model in the database, are listed in the middle. The final result shows
up on the bottom after the computation is done. The panel for continuous recognition
is very similar to that of isolate recognition except that the computational process part
is eliminated.
System Implementation 41
Figure 4.6: The isolated recognition panel.
4.4 Implementation Issues
While implementing the system, we encountered two big challenges: the first appeared
when we want to record training data, that is, how to obtain useful data from the Leap
Motion while ignoring the redundant information in an efficient manner, the second
challenge is how to continuously recognize transitions during the performance of a signer.
4.4.1 Data Acquisition
To improve user experience, we prefer to use less key press and mouse operations when
recording sequences. This means the system has to detect the start and the end of
a transition performance automatically without interaction between the user and the
keyboard. This can be achieved by taking advantage of the Leap Motion’s high frame
rate. In our system, a training sequence is recorded by the following steps:
1. The signer follows the instructions in the frame in Figure 4.3: put the dominate
hand in the effective area of the Leap Motion with the posture of the start letter.
2. When the timer goes to 0, the Leap Motion return a frame of data corresponding to
the first posture (hand shape of the start letter) and send it to the feature extractor,
System Implementation 42
then we get the first feature vector F0 of the sequence. In the meantime, the signer
can start to perform the transition to the end letter.
3. Although we have obtained the first feature vector of the sequence, the recording
has not started yet. The first feature vector is more like a reference, the start of
the recording is detected based on it. After the timer goes to 0, the Leap Motion
starts to return feature vectors extracted from the latest frame every 10 ms. We
calculate the difference between the current feature vector Fc and the reference
F0. Assume each feature vector Fi contains D elements from Fi0 to FiD−1 , then
the difference is calculated by:
| F0 − Fc |=D−1∑i=0
| F0i − Fci | (4.1)
If the difference is bigger than a given threshold θ1, which means the performance
is started, we put Fc as a second feature vector into the sequence; otherwise, ignore
it and wait for the next feature vector.
4. Once the recording is started, we put every feature vector returned from the Leap
Motion to the sequence until the end of transition criteria are met. The criteria
are:
(a) 500 ms have passed since the recording started.
(b) The difference of two successive feature vector is smaller than a threshold θ2.
The Leap Motion will check if hands can be detected in its effective area constantly. If
no hands detected during the recording, the recording will stop immediately and the
already recorded feature vector for the sequence will be ignored. The two thresholds
θ1 and θ2 used during the recording are not necessarily equal, usually we set θ1 > θ2
because a bigger θ1 can make sure that the detection of the start is reliable.
4.4.2 Continuous Recognition
One main advantage of modeling letter to letter transition is that it makes real-time
continuous recognition easier. The end of one transition is exactly the start of another,
which means we do not need to deal with the interaction between two transitions. In
our system, the time needed to recognize a transition from a 100 models database is less
than the average time needed to record a transition. Therefore, there is no need to chain
the transition models to form a word level HMM like they do in other literature [9, 22].
The challenge here becomes how to detect the start or end of a transition during the
System Implementation 43
continuous performance of a signer. The problem can be solved by using the same
method when dealing with training sequence segmentation in section 3.2.1 of Chapter 3,
that is, entropy estimation. We say the end of a transition is detected when the entropy
of the feature vectors in a window is smaller than a threshold θ3, because there will
always be an unintentional hesitation of the signer when a transition is performed which
results in the feature vectors returned in that shot period of time are very close to each
other (low entropy). We run recording, classifying and recognition in parallel as shown
in Figure 4.7, the three threads running at the same time and keep communicating with
each other.
Figure 4.7: Three threads for continuous recognition.
Thread 1 is responsible for sequence recording. All the feature vectors returned from
the Leap Motion are put into a list. The thread will be killed when no hands detected
by the Leap Motion or the recording is stopped by the signer.
Thread 2 keeps calculating entropy of a set of (within a window) feature vectors from
the already recorded sequence. If the end of a transition is detected, put the sequence
corresponding to the transition in a queue and continue to detect next transition point.
Thread 3 keeps checking if there are unrecognized sequences in the queue. For the first
sequence in the queue, an isolate recognition based on all the models in folder HMM
Database is run. After that, the recognition will be based on at most 30 models, because
the start letter of the next transition is known and there are only 30 possible transitions
from one letter to others. Thread 3 will be terminated when thread 1 has already been
killed and there is no unrecognized sequences in the queue.
Chapter 5
Experimental Results
This chapter presents a partial evaluation of our German finger spelling recognition
system (GFRS) and is divided into three sections: the first section describes the exper-
imental environment and the acquisition of the experimental data, the second section
shows details on how the experiments are conducted as well as their results, and a
discussion on the experimental results is presented in the last section.
5.1 Preparation
We want to evaluate our system in terms of recognition accuracy by experimenting on the
100 most frequently used transitions in German word formation as listed in Table B.1.
The experiments will be conducted on both isolated transition recognition and word
level continuous recognition. However, our focus will be on isolated recognition due to
the fact that continuous recognition is implemented by just running consecutive isolated
recognitions. For each transitions in the list, we will run an isolated recognition, the
performance of the system is determined by the number of transitions that are correctly
recognized.
As discussed in section 4.3.2 of Chapter 4, a sequence of full feature vectors (contains
all the features listed in Table 3.1 ) will be returned by the Leap Motion during the
performance of a transition. Furthermore, every time before training, the user can
specify a new feature vector whose elements are a subset of the full feature vector as
well as the states number of the HMM that will be used in the training. The elements
of the new feature vector can be extracted from the full ones, which makes sure that the
user do not have to record the training data again when a change of the feature vector
or the states number of the HMM is needed. Therefore, in our experiments, the training
data for the 100 transitions are recorded before the experiments.
44
Experimental Results 45
All training data for the 100 transitions is recorded by one person 1 who is familiar with
the system and German finger spelling. To eliminate the influences of other applications
running on the computer, all irrelevant processes are killed before the recording. The
training data for each transition contains 10 training sequences. During the recording,
the temperature of the Leap Motion can become quite high after running for a while,
which can affect its tracking performance. To acquire more reliable data, we let the signer
take a break after the training data for every 5 transitions is recorded while the Leap
Motion can cool down. In addition, we calibrate the device and clean its surface glass
regularly. The training data is stored in folder Training Sequences in the workspace.
5.2 Experiments
There are many potential factors that may affect the performance of the system:
• Dimension and elements of the feature vector,
• States number of the HMM,
• Size of the HMM database (number of transitions),
• Size of training data (number of training sequences) for each model.
Since the training sequences are pre-recorded, we therefore only experiment on the first
three factors to find an optimal configuration for our system.
5.2.1 Isolate Recognition
For convenience, instead of running one isolated recognition right after a recognition
sequence is recorded, we pre-record exactly one recognition sequence for each transition
in the transition list so that we can run 100 times of isolated recognitions by just clicking
a button. The recognition sequences are recorded by the same person who records the
training data and are stored in Test Sequences folder. The evaluation can be run when
the recognition sequences and the HMM database are ready. The recognition sequences
are stored in Test Sequences folder while the HMM database in HMM Database folder.
Files in the two folder should be same in number and names except the extension.
The files in HMM database is obtained through the training pipeline using the pre-
recorded training data in folder Training Sequences. Because the user can specify the
1We are dealing with German finger spelling, the person only need to learn how to perform the31 signs in German manual alphabet. In our case, the ability to use the system is more important.Therefore, the person here is not a Deaf signer.
Experimental Results 46
feature vector and the states number of the HMM before training, all the three factors
mentioned above are embedded in the HMM database.
When the evaluation starts, the system takes each sequence file in folder Test Sequences
and runs an isolate recognition based on the HMM database until all the sequences
are recognized. For example, a file namely ”AB.seq” is put into the recognizer, if the
probability that this sequence is generated by model ”AB.hmm” in the database is
the highest, we say that this sequence is successfully recognized. The final recognition
accuracy is the number of successfully recognized sequences divided by the total number
of sequences.
5.2.1.1 Experiment on Feature Vector
We run several tests on different feature vectors. As is discussed in Chapter 2, we have
15 candidate features that can be used to form the feature vector. The user can choose
any subset of them before training. Features 12, 13, 14 and 15 from category 2 and 3 are
essential in order to recognize transitions with hand rotation or movement and should
be included in all the feature vectors. Therefore, the feature vector is mainly determined
by features chosen from category 1. We choose different combinations of features from
category 1 that we believe can represent the hand shape. If the element of the feature
vector is represented by its corresponding number in Table 3.1, the feature vectors can
be represented by:
Feature Vector Features
A 1, 2, 3, 4, 5, 12, 13, 14, 15
B 1, 2, 3, 4, 5, 10, 12, 13, 14, 15
C 6, 7, 8, 9, 10, 12, 13, 14, 15
D 1, 2, 3, 8, 9, 10, 12, 13, 14, 15
E 1, 2, 6, 7, 8, 9, 10, 12, 13, 14, 15
F 1, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
G 1, 2, 3, 6, 7, 8, 9, 10, 12, 13, 14, 15
H 1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15
I 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15
Table 5.1: 9 feature vectors used in the experiments.
To make the experimental results convincing, for different feature vectors, we use HMM
with same topology (i.e, 3-state left-right HMM) to train each transition model based
on the same training data recorded in the preparation phase and also use the same
Experimental Results 47
recognition sequences to run the tests. During each test, we care about the recognition
accuracy, the time needed to train the models and the time needed to run an isolate
recognition. The results are shown in Table 5.2.
No. FV DIM NS T1 T2 Accuracy DS
1 A 9 3 279955 130.6 58% 100
2 B 10 3 322573 144.4 68% 100
3 C 9 3 280055 129.6 71% 100
4 D 10 3 306726 153.3 73% 100
5 E 11 3 338727 180.2 78% 100
6 F 11 3 337732 182.2 66% 100
7 G 12 3 369972 217.0 80% 100
8 H 13 3 401978 248.9 76% 100
9 I 14 3 465998 300.2 75% 100
Table 5.2: Experimental results on different feature vectors. ”FV”, the feature vectorused. ”DIM”, the dimension of the feature vector. ”NS”, state number of the HMM.”T1”, time (ms) needed to train the 100 models. ”T2”, average time (ms) needed torun an isolated recognition. ”DS”, size of the HMM database the experiment based on.
Both feature vector ”A” and ”B” use the distances of finger tips to the hand center to
represent a static hand shape, the only difference between them is that B has an extra
element to capture the number of extended fingers in a frame. However, the usage of
the feature ”Extended Fingers Count” allows to improve the recognition accuracy by
10% in experiment 2, which makes us believe it is an important feature and should be
used in the subsequent tests. In experiment 3, instead of using distances to hand center,
we use the distances of adjacent finger tips. Although the dimension is decreased by 1,
the accuracy is improved by 3%, which means this kind of representation contains more
useful information. In the following experiments, we use different combinations of the
two kinds of representations, the highest accuracy 80% is reached when using feature
vector ”G”. We also test the feature ”Finger Openness” in experiment 6 by replacing
the feature ”Index to Hand” by ”Index Finger Openness” in experiment 5, it turns out
that the accuracy drops by 12%, which indicates that the so called high level feature
does not work in our system.
In theory, the higher the dimension of the feature vector, the higher the recognition
accuracy. This is true when the dimension of the feature vector is low as can be seen by
comparing experiment 5 and 7. However, when the dimension exceed some threshold,
the accuracy begins to drop as shown in the results of experiment 8 and 9. In section 3.3
of Chapter 3, we have discussed the numerical underflow problem and pointed out that
Experimental Results 48
the computational complexity of Forward-Backward algorithm is 2TNT , where T is the
length of the recognition sequence and N the states number of the HMM. It seems that
the complexity has nothing to do with the dimension of the feature vector. Actually,
the formula holds only when the emission probability distribution is discrete. In our
system, the calculations involved in finding the probability that a mixture of multivariate
Gaussian distribution generates a specific feature vector are also needed to be considered.
This explains why we meet numerical underflow problem and obtain a drop of accuracy
when the dimension of feature vector increases.
In terms of computational time, we are not surprised to see that both the time needed
for training and isolated recognition increase as the dimension of the feature vector
increases. As a practical application, we care more about the isolated recognition time
because it can affect the speed of continuous recognition. It can be seen from the results
that the average time needed for running an isolated recognition when using feature
vector ”G” is 217ms which is much smaller that the time (600ms on average) needed to
record a recognition sequence.
5.2.1.2 Experiment on Number of States
In the previous section, we discussed the tests on feature vectors. We found an optimal
feature vector with low dimension while leading to relatively high recognition accuracy.
Considering all the tests were based on 3-state HMMs, a nature thought came into
our minds, how can the states number of the HMM affect the recognition in terms of
accuracy? To find an answer to this question, we carried out several experiments on the
number of states from 2 to 4 using two best-performing feature vectors from the last
experiment, ”E” and ”G”. The results are shown in Table 5.3.
No. FV DIM NS T1 T2 Accuracy DS
10 E 11 2 190216 123.5 82% 100
5 E 11 3 338727 180.2 78% 100
11 E 11 4 524297 242.3 75% 100
12 G 12 2 208978 146.7 78% 100
7 G 12 3 369972 217.0 80% 100
13 G 12 4 612685 290.8 69% 100
Table 5.3: Experimental results on different number of states when using featurevector ”E” and ”G”.
From the results we can see that the accuracy for feature vector ”G” reaches its peak at
states number 3 and then starts to drop, and the same thing happens to feature vector
Experimental Results 49
”E” even at a smaller state number 2. The reason why the accuracy does not increase
further is probably that the training data is not sufficient to train a model with a big
number of states. In our experiments, the training data for a model is fixed, if we want
to build a model with bigger state space, the feature vectors used to train each state
is decreased correspondingly. The model will be compromised when there is no enough
training data for each state.
5.2.1.3 Experiment on Size of HMM Database
When the size of the HMM database is 100, the highest recognition accuracy is 82%,
it appears when using feature vector ”E” and a 2-state HMM. The accuracy is still not
high enough to support continuous recognition. We believe that it is mainly because
of the big size of HMM database, a large database will increase the probability that a
transition has one or many similar transitions, which will significantly increase the risk
that a recognition sequence is recognized as its similar transitions. We try to narrow
down the size of the HMM database to figure out if the recognition accuracy can be
improved to a relatively high level (e.g., 95%).
In this experiment, we first reduce the size of the HMM database to 50 and then to
30. For each database size, we select 5 subsets from the 100 list randomly. The final
accuracy will be the average on the results of the 5 subsets. Table 5.4 shows the results
when testing on different database size using different HMM configurations.
No. FV DIM NS T1 T2 Accuracy DS
14 E 11 2 98489 32.21 86.34% 50
15 E 11 2 58708 12.68 89.96% 30
16 G 12 3 187020 55.01 85.52% 50
17 G 12 3 113406 21.53 89.91% 30
Table 5.4: Experiments results on different size of HMM database.
Experiments 10 and 14 both use feature vector ”E” and 2-state HMMs, by comparing
the results we can see that the recognition accuracy is improved by more than 4% when
the size of the database is reduced to 50, and the accuracy is further boost to nearly
90% when a 30 models database is used. We obtain a similar result when using feature
vector ”G” and 3-state HMMs.
Experimental Results 50
5.2.2 Continuous Recognition
Since the isolate recognition accuracy is promising when the size of the HMM database is
relatively small, we decided to test continuous recognition on small number of transitions.
In real world scenario, finger spelling is mainly used to represent names that are not
defined in German Sign Language. Therefore, we use the transitions contained in the
10 most common surnames in Germany to build our HMM database.
Muller, Schmidt, Schneider, Fischer, Weber,
Meyer, Wagner, Becker, Schulz, Hoffmann
The transitions extracted from the names are shown in Table 5.5.
MU UL LE ER SCHN NE ID DE
FI ISCH SCHE WE EB BE ME EY
YE WA AG GN EC CK KE SCHU
UL LZ HO OF FM MA AN -
Table 5.5: 31 letter to letter transitions in the ten names.
For the transitions that are not in the 100 transitions list (e.g., E to Y transition), we
record 10 training sequences as we do in the preparation phase. Each transition model
in the table is trained using feature vector ”G” and a 3-state HMM. We run 20 tests for
each name, and the correctly recognized number is listed in Table 5.6.
Name Correct NO. Total NO. Accuracy
Muller 13 20 65%
Schmidt 16 20 80%
Schneider 12 20 60%
Fischer 15 20 75%
Weber 12 20 60%
Meyer 13 20 65%
Wagner 14 20 70%
Becker 11 20 55%
Schulz 17 20 85%
Hoffmann 13 20 65%
Table 5.6: Experimental results on continuous recognition.
Experimental Results 51
The name with highest recognition accuracy is ”Schulz”, because the transitions com-
posing it are unique and can not be confused with other transitions. On the contrary,
names containing similar transitions are easily misrecognized, for example, ”Schneider”
is usually recognized as ”Schmeider” because the posture for letter M and N are very
similar from the Leap Motion’s perspective. The overall accuracy for continuous recog-
nition is 68% which is much lower than isolated recognition, however, we believe the
accuracy can be improved when Bayesian inference [39] (take the probability of one
letter appears right after another into consideration) is applied.
5.3 Discussion
Through the experiments we have discovered that all the factors like the dimension
and elements of the feature vector, states number of the HMM and the size of HMM
database have big influences on the performance of the system. Another factor that
can not be neglected is the input device. During the experiments we can monitor the
skeletal modeled hands in the Leap Motion Diagnostic Visualizer and find out that the
device has highly varying performance under different circumstances.
The Leap Motion has high precision on static hand with separated fingers, but the data
for static hand with closed fingers are not satisfying. For example, the posture for letter
F can be perfectly captured while the posture for letter E is usually misrepresented by
the Leap Motion as shown in Figure 5.1 and 5.2:
The data for moving hands and fingers are always noisy, which is not a good news
because all the transitions contains movement of fingers. The problems are caused by
lost or wrong fingers tracking (e.g., one fully opened finger recognized as two fingers in
Figure 5.3), temporal finger occlusions, and inconsistent sampling frequency [7].
Figure 5.1: Sign ofletter F in the Leap
Motion visualizer.
Figure 5.2: Signs ofletter E is misrepre-
sented.
Figure 5.3: One ex-tended finger recog-
nized as two.
Many letters in the manual alphabet has very similar postures (e.g., the postures of A,
S, M, and N in Figure 1.1), the Leap Motion has difficulty in finding the differences
Experimental Results 52
among them. Things become even worse when the transition itself is ambiguous, for
example, M to N transition only need a slight thumb position change. Therefore, it is
recommended to reduce the size of the database and include transitions that have visible
and separated fingers from the sensor’s perspective.
Chapter 6
Conclusion and Further Work
In this thesis, we introduce the German finger spelling recognition system (GFSR), a
system which is capable of recognizing continuous German finger spelling in real time.
The system uses the Leap Motion Controller to collect frames of data representing the
evolution of the user’s hand pose along time. In terms of letter representation, instead
of modeling static posture for each letter, letter to letter transition is modeled. The
transition can be seen as a dynamic gesture and is modeled using hidden Markov model,
which is a statistical model that can handle the variation of different signers.
By modelling letter to letter transitions, we can in theory recognize any finger spelled
words without limiting to a dictionary because it is the smallest unit to compose a
word. However, this also extends the number of models from 31 to 960. We ignore the
transitions that are not frequently used in a German corpus and decrease the number
to 100 which covers 80% of letter to letter transitions in German words formation.
The evaluation are conducted on both isolated recognition and world level continuous
recognition. For isolated recognition, the tests are on three aspects: the feature vector,
the states number of the HMM, and the size of the HMM database. When testing on
the 100 transitions, the result shows that the highest recognition accuracy 82% appears
using a 2-state left-right HMM (Table 5.3) and feature vector ”G” (Table 5.1). The
result does not seem promising enough for continuous recognition. This would most
probably be due to the big database size. After narrowing down the size of the database
to 30 transitions, the system can achieve accuracy of 89.96%. When it comes to the
task of continuous recognition, we focus on the fact that finger spelling is mainly used
to spell names, and run the tests on the transitions extracted from some most popular
surnames in Germany. It turns out that the system can recognize 10 names with an
accuracy of 68% due to the noisy data returned from the Leap Motion.
53
Conclusion and Further Work 54
Although the Leap Motion might not be the best choice in terms of dynamic German
finger spelling recognition. It has shown its value in some other HCI applications,
including isolated hand shape recognition [16, 18, 20] and computer games [40]. In fact,
our system is not restricted to finger spelling recognition, it can be used or extended to
recognize many kinds of dynamic gestures in general as long as a proper feature vector
is selected.
As further development, the system could be improved in many aspects. Since the data
returned by the Leap Motion is noisy, a proper pre-processing on the training data might
help to improve the system performance. Also, we can combine two or multiple devices
together, and data returned from different devices will be processed together using data
fusion methods to get more reliable information. Further experiments can be conducted
on a larger training data set to avoid the lack of training data problem when using big
state space HMMs. Training data can also be recorded from different signers to build a
user independent system which is more practical for real world.
Appendix A
JSON Structure from the Leap
JSON Frame data:
”currentFrameRate”: 105.122, ”gestures”: [], ”hands”: [ ”direction”: [ -0.0224476,
0.0632427, -0.997746 ], ”id”: 173, ”palmNormal”: [ -0.19773, -0.978564, -0.0575783 ],
”palmPosition”: [ -6.47867, 124.943, 12.5812 ], ”palmVelocity”: [ -1.86812, 1.93451, -
0.248464 ], ”r”: [ [ 0.94694, -0.124589, -0.29628 ], [ 0.005127, 0.927551, -0.373661 ],
[ 0.321369, 0.352315, 0.878974 ] ], ”s”: 1.21237, ”sphereCenter”: [ 5.55477, 211.896, -
32.5654 ], ”sphereRadius”: 102.235, ”stabilizedPalmPosition”: [ -3.9295, 130.376, 11.5911
], ”t”: [ -120.243, 15.6378, -37.449 ], ”timeVisible”: 8.41104 ], ”id”: 299775, ”interac-
tionBox”: ”center”: [ 0, 106.428, 0 ], ”size”: [ 125.185, 125.185, 78.6246 ] , ”pointables”:
[ ”direction”: [ -0.551342, -0.157666, -0.819246 ], ”handId”: 173, ”id”: 1730, ”length”:
46.3759, ”stabilizedTipPosition”: [ -76.1444, 116.216, -8.54233 ], ”timeVisible”: 8.41104,
”tipPosition”: [ -79.3754, 109.754, -6.69816 ], ”tipVelocity”: [ -1.17475, -1.49724, -
0.461121 ], ”tool”: false, ”touchDistance”: 0.285742, ”touchZone”: ”hovering”, ”width”:
18.0195 , ”direction”: [ -0.167748, -0.0766816, -0.982843 ], ”handId”: 173, ”id”: 1731,
”length”: 52.33, ”stabilizedTipPosition”: [ -35.9563, 135.028, -78.7522 ], ”timeVisi-
ble”: 8.41104, ”tipPosition”: [ -39.333, 131.466, -78.8971 ], ”tipVelocity”: [ -3.68745,
-2.67284, -0.0929271 ], ”tool”: false, ”touchDistance”: 0.260135, ”touchZone”: ”hov-
ering”, ”width”: 17.2122 , ”direction”: [ 0.0997289, -0.100053, -0.989971 ], ”handId”:
173, ”id”: 1732, ”length”: 59.6259, ”stabilizedTipPosition”: [ 3.4735, 131.115, -88.7222
], ”timeVisible”: 8.41104, ”tipPosition”: [ 0.964441, 127.355, -89.2757 ], ”tipVelocity”:
[ -2.40379, 1.80018, -0.512016 ], ”tool”: false, ”touchDistance”: 0.257204, ”touchZone”:
”hovering”, ”width”: 16.9047 , ”direction”: [ 0.160035, -0.0907604, -0.98293 ], ”han-
dId”: 173, ”id”: 1733, ”length”: 57.3319, ”stabilizedTipPosition”: [ 26.6008, 128.212,
-79.3807 ], ”timeVisible”: 8.41104, ”tipPosition”: [ 24.5168, 123.728, -79.6556 ], ”tipVe-
locity”: [ -1.59937, -0.517898, -0.349728 ], ”tool”: false, ”touchDistance”: 0.263316,
55
Appendices 56
”touchZone”: ”hovering”, ”width”: 16.0859 , ”direction”: [ 0.331746, -0.186307, -
0.924789 ], ”handId”: 173, ”id”: 1734, ”length”: 44.9471, ”stabilizedTipPosition”: [
49.5229, 117.765, -54.3087 ], ”timeVisible”: 8.41104, ”tipPosition”: [ 48.5701, 110.975,
-53.3266 ], ”tipVelocity”: [ -1.2279, -2.38346, 0.488928 ], ”tool”: false, ”touchDistance”:
0.272498, ”touchZone”: ”hovering”, ”width”: 14.2888 ], ”r”: [ [ 0.94694, -0.124589,
-0.29628 ], [ 0.005127, 0.927551, -0.373661 ], [ 0.321369, 0.352315, 0.878974 ] ], ”s”:
1.21237, ”t”: [ -120.243, 15.6378, -37.449 ], ”timestamp”: 98926509
Appendix B
The 100 Transitions
Token Prob. Token Prob. Token Prob. Token Prob.
EN 4.0568% LE 0.7881% RT 0.4706% MA 0.3277%
ER 3.7674% SS 0.7817% ZU 0.4706% UF 0.3148%
CH 2.6782% NS 0.7635% LL 0.4658% TR 0.3140%
DE 2.1959% IS 0.7591% AR 0.4615% EU 0.3098%
EI 2.1104% EL 0.7050% OR 0.4609% ZE 0.3082%
ND 2.0152% RA 0.6956% IG 0.4538% TU 0.3064%
TE 1.8183% LI 0.6548% WI 0.4477% LT 0.3047%
IE 1.6310% SI 0.6475% HR 0.4429% SO 0.3029%
IN 1.6138% RD 0.6184% ED 0.4326% TD 0.2947%
UN 1.6036% AL 0.5959% ET 0.4321% SA 0.2921%
GE 1.5880% TI 0.5645% NN 0.4190% NK 0.2759%
ES 1.3555% NA 0.5530% VE 0.4188% AB 0.2742%
ST 1.2160% WE 0.5496% LA 0.4108% OL 0.2708%
BE 1.1873% NT 0.5368% TS 0.4098% NZ 0.2686%
NE 1.1544% NI 0.5329% EH 0.4077% RB 0.2667%
RE 1.1250% DA 0.5281% MI 0.3995% HI 0.2633%
NG 1.1226% RS 0.5234% TA 0.3978% AC 0.2626%
HE 1.1026% AS 0.5233% RU 0.3870% RN 0.2624%
SE 1.0152% HA 0.5194% AT 0.3707% FE 0.2598%
IC 0.9892% HT 0.5194% EB 0.3681% TZ 0.2555%
AN 0.9604% ME 0.5166% NU 0.3662% NW 0.2552%
SC 0.9070% ON 0.5159% VO 0.3580% RG 0.2540%
DI 0.9024% RI 0.5003% EG 0.3559% IR 0.2518%
Continued on next page
57
Appendices 58
Table B.1 – Continued from previous page
Token Prob. Token Prob. Token Prob. Token Prob.
AU 0.8383% US 0.4977% UR 0.3559% RK 0.2510%
IT 0.8306% EM 0.4804% KE 0.3327% IL 0.2491%
Table B.1: The 100 transitions with their occurrence probabilities.
Appendix C
Class Diagrams
Figure C.1: Class diagram of recognition.
59
Appendices 60
Figure C.2: Class diagram of training.
Appendices 61
Figure C.3: Class diagram of data recording.
Bibliography
[1] Carol Padden and Tom Humphries. Deaf in America: voices from a culture. Har-
vard University Press, Cambridge, Mass., 1988. ISBN 0674194233 9780674194236
0674194241 9780674194243.
[2] Vicki L. Hanson Matt Huenerfauth. Sign Language in the Interface: Access for
Deaf Signers.
[3] Carol Padden and Claire Ramsey. American sign language and reading ability in
deaf children. Language acquisition by eye, 1:65–89, 2000.
[4] Maryam Khademi, Hossein Mousavi Hondori, Alison McKenzie, Lucy Dodakian,
Cristina Videira Lopes, and Steven C Cramer. Free-hand interaction with Leap Mo-
tion Controller for stroke rehabilitation. In CHI’14 Extended Abstracts on Human
Factors in Computing Systems, pages 1663–1668. ACM, 2014.
[5] Leap Motion and Kinect used in computer vision, . URL http://artandtech.
aalto.fi/?page_id=1323.
[6] Tilak Dutta. Evaluation of the Kinect sensor for 3-d kinematic measurement in the
workplace. Applied ergonomics, 43(4):645–649, 2012.
[7] Matevz Pogacnik Saso Tomazic Joze Guna, Grega Jakus and Jaka Sodnik. An Anal-
ysis of the Precision and Reliability of the Leap Motion Sensor and Its Suitability
for Static and Dynamic Tracking. Sensors 2014, pages 3702–3720, 2014.
[8] The official website of the Leap Motion, . https://www.leapmotion.com/.
[9] P GOH. Automatic recognition of auslan finger spelling using hidden markov mod-
els. undergraduate, 2005.
[10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning,
20(3):273–297, 1995.
[11] Roberto Brunelli. Template matching techniques in computer vision: theory and
practice. John Wiley & Sons, 2009.
62
Bibliography 63
[12] Brian D. Ripley. Pattern recognition and neural networks. Cambridge university
press, 1996.
[13] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
[14] Robert Y Wang and Jovan Popovic. Real-time hand-tracking with a color glove.
ACM Transactions on Graphics (TOG), 28(3):63, 2009.
[15] Official website of the microsoft kinect. https://www.microsoft.com/en-us/
kinectforwindows/.
[16] Tatiana Schmidt, Felipe P Araujo, Gisele L Pappa, and Erickson R Nascimento.
Real-time hand gesture recognition based on sparse positional data.
[17] Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R
news, 2(3):18–22, 2002.
[18] F. Zanuttigh P. Marin, G. Dominio. Hand gesture recognition with leap motion
and kinect devices. pages 1565 – 1569, 2014.
[19] James Davis and Mubarak Shah. Visual gesture recognition. In Vision, Image and
Signal Processing, IEE Proceedings-, volume 141, pages 101–106. IET, 1994.
[20] Jakub Wasikowski Katarzyna Zjawin Michal Nowicki, Olgierd Pilarczyk. Gesture
recognition library for leap motion controller.
[21] James Gosling. The Java language specification. Addison-Wesley Professional,
2000.
[22] Susanna Ricco. Carlo Tomasi. Fingerspelling Recognition through Classification of
Letter-to-Letter Transitions. Computer Vision – ACCV 2009, pages 214–225, 2010.
[23] Hidden markov model introduction from wikipedia. http://en.wikipedia.org/
wiki/Hidden_Markov_model.
[24] G David Forney Jr. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278,
1973.
[25] Douglas Reynolds. Gaussian mixture models. In Encyclopedia of Biometrics, pages
659–663. Springer, 2009.
[26] W. C. Stokoe. Sign language structure: An outline of the visual communication
system of the American deaf. Studies in linguistics, Occasional papers, 8, 1960.
[27] S. Liddell and R. Johnson. American Sign Language: The Phonological Base. Sign
Language Studies, 64:195–277, 1989.
Bibliography 64
[28] S. Liddell. Structures for representing handshape and local movement at the pho-
netic level. In University of chicagon press, editor, Theoritical issues in sign language
research, 1990.
[29] Sign Language IPA. URL http://dedalvs.free.fr/slipa.html#handshape.
PRIVATE=1.
[30] Dimitris Metaxas Christian Vogler. A Framework for Recognizing the Simultaneous
Aspects of American Sign Language. Computer Vision and Image Understanding
81, pages 358–384, 2001.
[31] Christian Vogler and Dimitris Metaxas. Handshapes and movements: Multiple-
channel asl recognition. In Lecture Notes in Computer Science, pages 247–258.
Springer, 2004.
[32] W Sandler. Representing handshapes. International Review of Sign Linguistics, 1:
115–158, 1996.
[33] Paul S Bradley and Usama M Fayyad. Refining initial points for k-means clustering.
In ICML, volume 98, pages 91–99. Citeseer, 1998.
[34] Jan Beirlant, Edward J Dudewicz, Laszlo Gyorfi, and Edward C Van der
Meulen. Nonparametric entropy estimation: An overview. International Journal of
Mathematical and Statistical Sciences, 6(1):17–39, 1997.
[35] An introduction of entropy estimation from wikipedia. URL https://en.
wikipedia.org/wiki/Entropy_estimation.
[36] D.B. Paul. Speech Recognition Using Hidden Markov Models. The Lincoln
Laboratory Journal, Volumn 3, Number 1, 1990.
[37] The website where we obtain a german corpus. http://korpora.zim.uni-due.
de/Leitseite/.
[38] Java implementation of Hidden Markov Model (HMM) related algorithms. https:
//code.google.com/p/jahmm/.
[39] David D Lewis. Naive (bayes) at forty: The independence assumption in informa-
tion retrieval. In Machine learning: ECML-98, pages 4–15. Springer, 1998.
[40] The website of the official game store where a lot of game applications controlled
by the leap motion can be found, . https://apps.leapmotion.com/.