signer independent isolated italian sign recognition based on hidden markov models
TRANSCRIPT
SHORT PAPER
Signer independent isolated Italian sign recognition basedon hidden Markov models
Marco Fagiani • Emanuele Principi •
Stefano Squartini • Francesco Piazza
Received: 5 February 2013 / Accepted: 22 July 2014
� Springer-Verlag London 2014
Abstract Sign languages represent the most natural way
to communicate for deaf and hard of hearing. However,
there are often barriers between people using this kind of
languages and hearing people, typically oriented to express
themselves by means of oral languages. To facilitate the
social inclusiveness in everyday life for deaf minorities,
technology can play an important role. Indeed many
attempts have been recently made by the scientific com-
munity to develop automatic translation tools. Unfortu-
nately, not many solutions are actually available for the
Italian Sign Language (Lingua Italiana dei Segni—LIS)
case study, specially for what concerns the recognition
task. In this paper, the authors want to face such a lack, in
particular addressing the signer-independent case study,
i.e., when the signers in the testing set are to included in the
training set. From this perspective, the proposed algorithm
represents the first real attempt in the LIS case. The auto-
matic recognizer is based on Hidden Markov Models
(HMMs) and video features have been extracted using the
OpenCV open source library. The effectiveness of the
HMM system is validated by a comparative evaluation
with Support Vector Machine approach. The video material
used to train the recognizer and testing its performance
consists in a database that the authors have deliberately
created by involving 10 signers and 147 isolated-sign
videos for each signer. The database is publicly available.
Computer simulations have shown the effectiveness of the
adopted methodology, with recognition accuracies com-
parable to those obtained by the automatic tools developed
for other sign languages.
Keywords Italian sign language (LIS) � Automatic sign
recognition � Video feature extraction � Hidden Markov
models
1 Introduction
Deaf and hard of hearing widely use sign languages to
communicate [32]. These languages have arisen sponta-
neously and evolved rapidly, all over the globe. As widely
accepted by the scientific community, they are character-
ized by a relevant relationship with the geographic location
where they are used and by a strong independence on the
oral languages diffused in the same region. It follows that
different sign languages have arisen and have been offi-
cially recognized in most countries, like United States,
United Kingdom, Germany, and China.
As for the oral case, a large variety of dialects can be
found within the area of influence of the ‘‘official’’ sign
language and this can give origin to misunderstanding
problems among people living in different places in that
area. Beside this aspect, the communication between deaf
and hearing people is most of the time troublesome,
because people using oral languages are not typically well
disposed to learn a sign language, even recognizing that
this could represent an effective way to enrich their com-
munication capabilities. This inevitably induces the estab-
lishment of social inclusion barriers, which the entire
population is called to face and likely overcome. The
impact of these problems increases considering that the
interpreters, with their indispensable work, cannot be
always present and that less of the 0.1 % of total popula-
tion belong to deaf community.
Scientists and technicians from all over the world have
recently addressed the issue and several solutions have
M. Fagiani � E. Principi (&) � S. Squartini � F. Piazza
Department of Information Engineering, Universita Politecnica
delle Marche, Via Brecce Bianche 1, 60131 Ancona, Italy
e-mail: [email protected]
123
Pattern Anal Applic
DOI 10.1007/s10044-014-0400-z
been proposed to facilitate the usage of sign languages to
deaf and also the interaction among them and hearing
people in different communication scenarios. Many auto-
matic translation systems (Sign-to-Sign, Sign-to-Oral, or
Oral-to-Sign) have been developed, taking the specificity
of the examined languages into account. Typically, these
systems require two separate tasks: recognition and syn-
thesis. In this work, the focus is specifically on the former
and a brief state of the art of existing methodologies for
automatic recognition of sign languages and related dat-
abases is provided in the following, first from an interna-
tional perspective and then looking closer at the Italian
context.
In the international scenario many research projects can
be found. Starner and Pentland [35] have been the first to
propose a system based on Hidden Markov Models
(HMMs) for sign language recognition. The approach is
capable of performing real-time sentence recognition of the
American Sign Language using only hands information
such as position, angle of axis of least inertia, and eccen-
tricity of bounding ellipse. In a later study [36], the authors
presented a real-time continuous sentence-level recognition
based on two different point of view for the camera: a desk-
based and a cap-mounted. In [2] a sign recognition system
in signer-independent and signer-dependent conditions has
been investigated. Both hands-related features, based on a
multiple hypotheses tracking approach, and face-related
features, based on an active appearance model, are robustly
extracted. The classification stage is based on HMM and is
designed for the recognition of both isolated and continu-
ous signs. The work in [37] proposed a so-called product
HMM for efficient multi-stream fusion, while [27] dis-
cussed the optimal number of the states and mixtures to
obtain the best accuracy.
SignSpeak [14] is one of the first European funded
projects that tackles the problem of automatic recognition
and translation of continuous sign languages. Among the
works of the project, Dreuw and colleagues in [11–13]
introduced important innovations: they used a tracking
adaptation method to obtain a hand tracking path with
optimized tracking position, and they trained robust models
through visual signer alignments and virtual training
samples.
In [42], Zaki and Shaheen presented a new combination
of appearance-based features obtained using Principal
Component Analysis (PCA), the kurtosis position and
Motion Chain Code (MCC). First of all, they detect the
dominant hand using a skin colour detection algorithm that
is then tracked by a connected component labelling. After
that, PCA provides a measure for the hand configuration
and orientation, kurtosis position is used for measuring the
edge which denotes the place of articulation and MCC
represents the hand movement. Several combinations of the
proposed features are tested and evaluated, and an HMM
classifier is used to recognize signs. Experiments are per-
formed using the RWTH-BOSTON-50 database of isolated
signs [41]. In [25, 28] two approaches based on Support
Vector Machines (SVM) for the recognition of the sole
hand gesture have been proposed. In the first, a multi-
feature classifier combined with spatio-temporal charac-
teristics has been used to realize an automatic recognizer of
sign language letters. In the second, a combination of ad
hoc weight eigenspace Size Function and Hu moments has
been used.
Taking a closer look at the databases for international
sign languages, in [41], as mentioned above, and [15] two
free automatic sign language databases are presented, both
created as subset of the general database recorded at the
Boston University1. The video sequences have been
recorded at 30 frames per second (fps) with a resolution of
312� 242 pixels. Sentences have been executed by three
different people, two men and one woman, all dressed
differently and with different brightness of the clothes. The
first database, RWTH-BOSTON-50, is made of 483 utter-
ances of 50 signs, with each utterance stored in a separate
video file to create a video collection of isolated signs. The
second corpus, RWTH-BOSTON-104, is made of 201
sentences (161 for training and 40 for testing) and the
dictionary contains 104 different signs. Unlike the first one,
each video contains a sequence of signs. The SIGNUM [1]
corpus is composed both of isolated sign and of sign
sequences executed by 25 persons, and recordings have
been conducted in controlled conditions, to facilitate the
video feature extraction. The videos have been recorded
with a resolution of 780� 580 pixels at 30 fps. The dic-
tionary is made of 450 DGS sign composing a total of 780
sentences with sign sequences.
Regarding the Italian scenario, Infantino et al. [24]
presented a signer-dependent recognition framework
applied to Italian sign language (in Italian Lingua Italiana
Segni, LIS). Features are composed of centroid coordinates
of the hands and classification is based on a self-organizing
map neural network. In addition, the framework integrates
a common-sense engine that selects the right meaning
depending on the context of the whole sentence. Up to the
authors’ knowledge, this is the only work appeared in the
literature so far that deals with the Italian sign language
recognition. For what concern the LIS databases, an
interesting work has been recently presented within the
ATLAS2 project [4], that describes the development of a
trilingual corpus of the Italian sign language. The database
will be used to realize a virtual interpreter, which auto-
matically translates an Italian text into LIS, and it is
1 http://www.bu.edu/asllrp/ncslgr.html.2 http://www.atlas.polito.it/index.php/en/home.
Pattern Anal Applic
123
publicly available to the community. Actually, no infor-
mation about the number of signers, recording conditions,
video properties, and signs dictionary are provided. As
stated by [4], it includes many recordings with thousands of
signs, but they are executed by single signers thus repre-
senting a limitation in the development of automatic sign
recognition systems.
The objective of this work is proposing an automatic
recognition system of isolated LIS signs. Differently from
[24], the system operates independently from the signer,
i.e., it is able to deal with signers not included in the
training phase. An HMM approach has been employed on
purpose, inspired to other existing tools for oral languages
and also for some international sign languages, as men-
tioned above. The isolated sign case study has been
addressed. Due to the lack of a suitable database to train
and evaluate the system, a specific LIS database has been
created in collaboration with the local office of the Deaf
People National Institute (Ente Nazione dei Sordi—ENS).
The database, namely A3LIS-147, is composed of more
than 1,400 signs performed by 10 different signers and it is
publicly available. The proposed recognizer has been par-
tially presented in [17], and here a more detailed descrip-
tion of the related algorithms and insights on the attainable
performance are provided. In particular, here the algorithm
parameters have been optimized to further improve the
recognition performance, and for an additional validation
the approach has been also compared to an SVM Machine
recognizer. In addition, each sign of the corpus has been
transcribed to the HamNoSys transcription system and a
thorough discussion on the correlation between the edit
distance of the signs transcriptions and the error rates has
been provided. As pointed out in the following sections, the
obtained recognition accuracy (superior to 48 %) is con-
sistent with the results obtained by alternative techniques
related to other sign languages, thus proving the effec-
tiveness of the approach.
This is the paper outline. In Sect. 2, the main charac-
teristics of the A3LIS-147 Database are reviewed. Sec-
tion 3 reports the algorithms in the two stages composing
the automatic LIS recognition system, i.e., the Feature
Extractor and the HMM-based recognizer, whereas in
Sect. 4 the experimental tests are described and related
results are commented. Section 5 presents a comparison
between the proposed approach and the SVM method.
Section 6 concludes the paper providing also some future
works highlights.
2 A3LIS-147: Italian sign language database
The creation of a signer-independent automatic sign rec-
ognition system requires a suitable corpus for training its
parameters and assessing its performance. For the isolated
sign recognition task, the corpus should contain videos
with many different signs executed by multiple signers.
The lack of a well-suited corpus for the Italian sign lan-
guage led the authors to create a new one from scratch.
The database, A3LIS-147, is freely available3 and has
been presented in [16]. It consists of 147 distinct isolated
signs executed by 10 different persons (7 males and 3
females). Each signer executed all the 147 signs, for a total
of 1,470 recorded videos. The ENS4 supported the authors
of this work to suitably pre-train the subjects and to choose
a meaningful and unambiguous set of signs to be executed.
In A3LIS-147, signs are divided in six categories, each
related to different real-life scenarios (Table 1): public
institute (e.g., ‘‘employee’’), railway station (e.g., ‘‘train’’),
3 http://www.a3lab.dii.univpm.it/projects/a3lis.4 http://www.ensancona.it.
Table 1 The proposed real-life scenarios, number of signs per sce-
nario, and vocabulary examples
Scenario Signs per
scenario
Vocabulary examples
Public
institute
39 Employee, municipality,
timetable
Railway
station
35 Train, ticket, departure
Hospital 19 Emergency, doctor, pain
Highway 8 Traffic, toll booth, delays
Common life 16 Water, telephone, rent
Education 30 School, teacher, exam
Fig. 1 The ‘‘silence’’ position
Pattern Anal Applic
123
hospital (e.g., ‘‘emergency’’), highway (e.g., ‘‘traffic’’),
common life (e.g., ‘‘water’’), and education (e.g.,
‘‘school’’). In each video, the person executes a single sign,
preceded and ensued by the ‘‘silence’’ sign shown in Fig. 1.
This sign has been chosen in agreement with the ENS,
since it represents a common ‘‘rest’’ position in everyday
conversations. Recordings have been supported by a
sequence of slides projected in front of the person exe-
cuting the sign. Each slide shows the text and the picture of
the current sign as well as a timer to discriminate the three
phases of the execution: silence, sign execution, silence.
Recordings have been performed in controlled condi-
tions: as Fig. 1 shows, the signers wear dark long-sleeved
shirts, a green chroma-key background is placed behind
them, and uniform lighting is provided by two diffuse
lights. Videos have been captured at 25 fps with a resolu-
tion of 720� 576 pixel and stored in Digital Video format.
For additional details on A3LIS-147, please refer to [16].
3 Description of the proposed system
The structure of the proposed system is shown in Fig. 2.
Similarly to the ‘‘isolated word recognition’’ task in speech
recognition, the system performs isolated sign recognition,
i.e., it recognizes a single sign present in a input video. The
main processing blocks of the system are: Face and Hands
Detection (FHD), Features Extraction (FE), and Sign
Recognition (SR).
The FHD block discriminates the regions of interest in
each video frame. The regions are represented by the hands
and the face of a person, since they convey the information
needed to discriminate signs. Features are then extracted in
the FE stage from the pixels belonging to these areas, and
for each input frame a single feature vector is created. The
classification stage is based on HMM. For each sign, an
HMM is created in the training phase using signs executed
by different persons. The SR phase operates calculating the
output probabilities for each model, and then selecting the
model which scored best.
3.1 FHD
The detection of hands and face areas is performed on each
input frame using a skin detection algorithm. This operates
on the YCbCr colour space since it gives better perfor-
mance respect to the Red Green Blue colour space [19]. A
pixel whose coordinates are ðY;Cb;CrÞ belongs to the skin
region if Y [ 80, 85\Cb\135 and 135\Cr\180 with Y ,
Cb, Cr 2 ½0; 255� [26]. Pixels that satisfy this condition are
included in the skin mask image.
Face and hands are then determined choosing the
regions with largest areas (Fig. 3b). These areas can be
inhomogeneous, so they are merged by means of ‘‘mor-
phological transformations’’. Notice that here ‘‘morpho-
logical transformations’’ refer to image processing
operations, and they must not be confused with the sign
language morphology. These operations consist in a clos-
ing operation, that is produced by the combination of the
dilation and erosion operators [7, 34]. Dilation consists of
the convolution of an image with a kernel (e.g., a small
solid square or disk with the anchor point at its center)
which acts as a local maximum operator. The effect is that
bright regions of an image widen. Erosion is the converse
operation and has the effect of reducing the area of bright
regions. The closing operation employed here is produced
first dilating and then eroding the areas detected as skin.
The final effect is the removal of areas that arise from
noise, and the union of nearby areas that belong to the same
region of interest.
The overlap of two or more regions of interest results in
loss of information. If one hand overlaps with the other or
with the face, it can be difficult to determine its position.
This is why such situation has to be carefully detected and
handled. In this work, the distance between the camera and
the signer is supposed to vary by a small percentage from
frame to frame. Considering also that the skin mask con-
tains three areas when overlap does not occur, an overlap
can be detected following these rules:
Fig. 2 The proposed isolated sign recognition system
Pattern Anal Applic
123
1. If only one skin region is present in the skin mask, a
face–hands overlap is detected;
2. Otherwise, if two regions are present and the face area
increased by a certain percentage respect to the
previous frame, a face–hand overlap is detected;
3. Otherwise, if two regions are present and the face area
did not increase, a hand–hand overlap is detected;
4. Otherwise, if three regions are present, no overlap is
detected.
When a hand–hand overlap is detected, the system assigns
the same area to both hands, whereas, in the event of a
face–hand overlap, it assigns the face area to the over-
lapped hand. Whenever a face–hands overlap is detected,
both hands areas coincide with the face one.
3.2 Feature extraction
Table 2 shows the entire feature set calculated from the
regions of interest. The set can be divided in two dif-
ferent subsets: the first comprises features 1–8 and
includes the centroid coordinates of the hands respect to
the face, their derivatives and the distance between the
two hands. The second subset comprises features 9–17
and represents the general characteristics of the regions,
such as the area, the shape type, and the 2-D spatial
orientation of the hands.
Features 1–3 and features 4–5 as well as features 9 and
10 differ for the region of interest employed in their
computation: features 1–3 and feature 9 exploit the Canny
algorithm [8] applied within the bounding rectangle of the
hands. Features 4–8 and feature 10 use directly the con-
tours segmented from the skin mask. Figure 4 shows an
example extracted from the A3LIS-147 database, where the
detected area rectangles and the obtained contours are
displayed. The contours are then used to calculate the
regions centroid and the central moments. The centroid is
calculated as follows:
x ¼ m10
m00
; ð1Þ
y ¼ m01
m00
; ð2Þ
where mij are the spatial moments defined from the pixel
intensities Iðx; yÞ as:
mij ¼X
x
X
y
Iðx; yÞxiyj: ð3Þ
In the equation, x and y are the pixel abscissa and ordinate
and Ið�; �Þ 2 ½0; 255�. The central moments li;j are defined
as:
lij ¼X
x
X
y
Iðx; yÞðx� xÞiðy� yÞj: ð4Þ
Regarding the face centroid ðxf ; yf Þ, its position is almost
stationary, but an overlap with the hands can cause
incongruous deviations. This effect has been reduced
smoothing the centroid coordinates at frame t as follows:
ðx0f ; y0f Þt ¼ bðxf ; yf Þt þ ð1� bÞðx0f ; y0f Þt�1; ð5Þ
Fig. 3 A frame of the sign
‘‘casa’’. a The original video
frame. b The detected skin
regions. c The skin regions as
modified after the
morphological transformation
Table 2 The entire feature set
Features
1 Hands centroid normalized respect face region width (Canny-
Filter)
2 Derivatives of feature 1
3 Normalized hands centroid distance (Canny-Filter)
4 Hands centroid normalized respect face region width
5 Derivatives of feature 4
6 Normalized hands centroid distance
7 Hands centroid referred to the face centroid
8 Derivatives of feature 7
9 Face and hands Hu-moments (Canny-Filter)
10 Face and hands Hu-moments
11 Hands area normalized respect to the face area
12 Derivatives of feature 11
13 Face and hands compactness 1
14 Face and hands eccentricity 1
15 Face and hands compactness 2
16 Face and hands eccentricity 2
17 Hands orientation
‘‘Canny-Filter’’ indicates that features have been calculated after
applying the Canny algorithm to the detected regions
Pattern Anal Applic
123
where b 2 ½0; 1� and has been set to 0.1 in the experiments.
In features 1–8 centroid coordinates are normalized
respect to the face region width wf so that they are inde-
pendent of the distance from the camera and the signer’s
height.
Derivatives in features 2 and 5 are calculated as the
difference between the coordinates at frame t and at frame
t � 1:
_xt ¼ xt � xt�1; ð6Þ
_yt ¼ yt � yt�1: ð7Þ
The normalized Euclidean distance between the hands
(features 3 and 6) is computed as follows:
d ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijðxl; ylÞ � ðxr; yrÞj2
q
wf
; ð8Þ
where the l and r subscripts denote the centroid coordinates
of the left and right hand, respectively.
In the computation of feature 7, the origin of the coor-
dinate system is represented by the face area centroid
ðxf ; yf Þ and feature 8 is obtained calculating its derivative
as in Eq. (6).
Regarding the second subset, features 9 and 10 are
the Hu moments [23]: these are invariant to transla-
tion, changes in scale, and rotation. Both for the face
and the hands regions, seven different values are
computed:
I1 ¼l20 þ l02; ð9Þ
I2 ¼ l20 � l02ð Þ2þ4l211; ð10Þ
I3 ¼ l30 � 3l12ð Þ2þ 3l21 � l03ð Þ2; ð11Þ
I4 ¼ l30 þ l12ð Þ2þ l21 þ l03ð Þ2; ð12Þ
I5 ¼ l30 � 3l12ð Þ l30 þ l12ð Þ l30 þ l12ð Þ2�3 l21 þ l03ð Þ2h i
þ
þ 3l21 � l03ð Þ l21 þ l03ð Þ 3 l30 þ l12ð Þ2� l21 þ l03ð Þ2h i
;
ð13Þ
I6 ¼ l20 � l02ð Þ l30 þ l12ð Þ2� l21 þ l03ð Þ2h i
þ
þ 4l11 l30 þ l12ð Þ l21 þ l03ð Þ;ð14Þ
I7 ¼ 3l21 � l03ð Þ l30 þ l12ð Þ l30 þ l12ð Þ2�3 l21 þ l03ð Þ2h i
�
� l30 � 3l12ð Þ l21 þ l03ð Þ 3 l30 þ l12ð Þ2� l21 þ l03ð Þ2h i
;
ð15Þ
In the equations, lij is the respective central moment
defined in Eq. (4), and I7 is useful in distinguishing mirror
images.
Feature 11 is obtained calculating the areas of the rec-
ognized skin zones as the total number of pixels detected,
and the derivatives (feature 12) are calculated similarly to
Eq. (6):
_At ¼ At � At�1: ð16Þ
where At denotes the area of a region of interest at the time
frame t.
Compactness c1 (feature 13) and eccentricity e1 (feature
14) of face and hands have been proposed in [37]. The first
is computed as the ratio between the minor (Axismin) and
major (Axismax) axis lengths of the rotated rectangle that
inscribe the detected region:
c1 ¼Axismin
Axismax
: ð17Þ
Eccentricity is calculated from the area A and the perimeter
P of the region as follows:
e1 ¼Anorm
P2; ð18Þ
where Anorm is obtained dividing the area A by the face area
and the perimeter P is the length of the region contour used
in the feature 4.
Features 15–17 have been introduced in [2] and are
defined as follows:
c2 ¼ð4pAnormÞ
P2; ð19Þ
Fig. 4 Contours detection from
the skin regions (a). Bounding
rectangle of hands skin zones
(b). The hands edge obtained by
the Canny algorithm (c)
Pattern Anal Applic
123
e2 ¼ðl20 � l02Þ2 þ 4l11
Anorm
; ð20Þ
o1 ¼ sinð2aÞ; o2 ¼ cosðaÞ: ð21Þ
Equations (19) and (20) represent alternative definitions of
compactness and eccentricity. Equation (21) defines the
hand orientation, obtained from the difference of the angle
between the image vertical axis and the main axis of the
rotated rectangle that inscribe the hand region. In the
equation, the angle a 2 ½�90�; 90�� and the orientation is
split into o1 and o2 to ensure stability at the interval borders.
3.3 Sign recognition
Let S ¼ fs1; s2; . . .; sNg be the set of signs and N be the
total number of signs. Each sign si is modelled with a
different HMM ki. To capture the sequential nature of a
sign execution, a left–right structure has been chosen on
purpose (Fig. 5). State emission probability density func-
tions are represented by mixtures of gaussians with diag-
onal covariance matrices. In defining the sign model, the
number of states and the number of gaussians in the mix-
ture must be prior set. These parameters have been deter-
mined experimentally as described in Sect. 4.
Denoting with O the observation sequence, sign recog-
nition is performed calculating the probability PðOjkiÞ 8i 2f1; 2; . . .;Ng through the Viterbi algorithm and then
selecting the model whose probability is maximum.
4 Experiments
Video processing has been performed by means of the open
source computer vision library OpenCV [6]. The afore-
mentioned A3LIS-147 database has been used both for
training the recognizer and for evaluating its performance.
The HMM-based sign recognition stage has been imple-
mented using the HMM toolkit [40].
The first experiment studies how the number of states
and of gaussians per state in the sign models affects the
recognition performance. The second experiment analyses
the performance of the system for different combinations
of features. The last experiment investigates the optimal
number of states and mixtures for the selected features and
performs the final test.
All the experiments have been performed following the
three-way data split approach [31] combined with the K-fold
cross-validation method [38]. The dataset is split into K ¼10 folds, each one composed of all the signs executed by the
same signer. For each parameters combination, two different
folds are chosen iteratively as validation and test sets. The
remaining K � 2 folds form the training set, and the algo-
rithm parameters are determined on the validation fold. In
the features selection, each combination is tested with same
method. The final system performance is evaluated using
onefold as test, and the remaining K � 1 folds as training.
Note that differently from [24], the independence between
training, validation and test sets is always guaranteed.
The performance have been evaluated using the sign
recognition accuracy, defined on the basis of the number of
substitution (S), deletion (D), and insertion errors (I) as in
speech recognition tasks [40]:
Accuracy ¼ ðN� D� S� IÞN
: ð22Þ
where N is the total number of labels in the reference
transcriptions. It is worth pointing out that in the isolated
sign recognition task addressed in this work only the sub-
stitution errors are present.
4.1 Model parameters selection
To find an estimate of the number of states and of the
number of gaussians per state of the sign models, the
system has been tested with different combinations of these
parameters. In particular, the number states has been varied
from 1 to 20 and the number of gaussians from 1 to 8.
Observations are represented by features 4, 5, and 6.
Figure 6 shows the obtained results: as expected,
increasing the number of states and the number of gaus-
sians per state does not always produce better performance.
Fig. 5 A five states left-right HMM. State I and F are non-emitting
Fig. 6 Sign recognition accuracy for different number of states and
mixtures using the three-way data split and the K-fold cross-
validation methods. The circle indicates the selected operating point
Pattern Anal Applic
123
In particular, as long as the model has few parameters to
train, i.e., the number of states is less than 5, the system
performance increases with the number of states or mix-
tures. Beyond this value, the training data is not sufficient
to accurately estimate the model parameters. As shown in
Fig. 6, increasing the number of mixtures does not produce
a further performance improvement when the number of
states is higher than 7. As starting point for the feature
selection, an HMM model with 18 states and 1 mixture has
been chosen. This combination gives a recognition accu-
racy of 36:43 % (highlighted with a circle in Fig. 6).
4.2 Feature selection
To study which combination of features gives the overall
best performance in the proposed system, the Sequential
Forward Selection (SFS) technique has been followed [38].
SFS is suboptimal, but requires less computational burden
respect to full-search approaches. SFS consists of the fol-
lowing steps:
– Definitions
– M: total number of features.
– Initialization
– Create a set of M feature vectors, each composed of
a single feature.
– Algorithm
– for i ¼ 1 to M � 1
1. Compute the criterion value for each feature
vector of the set.
2. Select the winner vector, i.e., the one whose
criterion value is greatest.
3. Create a new set of M � i feature vectors
adding the excluded features to the winning
vector.
– end
– Select the feature vector whose criterion value is
greatest among all the created feature vectors.
In the experiments, the sign recognition accuracy has been
used as criterion value.
The results for each combination of features have been
evaluated by means of the three-way data split combined
with the K-fold cross-validation method, as illustrated in
Sect. 4.
Figure 7 shows the results obtained for the 17 iterations
of the SFS technique. Each column of the chart reports the
accuracy averaged over each three-way data split with K-
fold cross-validation step. The figure shows that adding
more features constantly improves the performance until a
maximum is reached at iteration 7. Then, the accuracy
degrades systematically, demonstrating that the additional
features do not improve the recognition capabilities of the
system. The feature vector that gives the overall best
accuracy is composed of features 5, 6, 7, 8, 12, 13, and 14
for a total of 21 coefficients.
Table 3 shows the detailed results for each signer. As
table shows, the best sign recognition accuracy is obtained
by signer MRLB using features 5, 7 and 13. As afore-
mentioned, the overall best accuracy is obtained at iteration
number seven. It is interesting to note the high degree of
variability across the signers: the recognition accuracy
standard deviation at iteration seven is 11.01 %, suggesting
that further normalization techniques could be employed to
improve signer independence.
4.3 Model parameters refinement
To find the overall best sign recognition accuracy obtain-
able with the proposed system, an additional experiment
has been performed. Here, the feature set determined in the
previous section is employed, and the same procedure
described in Sect. 4 has been followed to determine the
number of states and the number of gaussians per state of
the HMM sign model. The results in Table 4 show that the
model with 18 states and 1 mixture per state still produces
the best performance for the selected features.
Figure 8 shows the final results for each signer using the
determined model. The final average recognition accuracy
of the system is 48.06 %.
4.4 Discussion
Better insights on the performance of the HMM-based
system can be inferred analyzing the correlation between
the edit distances of the signs transcriptions and the related
error rates. Every signs has been transcribed using the
Fig. 7 Sign recognition accuracy for the 17 SFS iterations
Pattern Anal Applic
123
HamNoSys transcription system [22] (the complete tran-
scriptions are shown in the Appendix), and calculating the
edit distance between a pair of signs gives valuable infor-
mation about the their confusability. Since HamNoSys
describes the movements of both hands in a single string,
each transcription has been decomposed in two substrings,
each related to a different hand. If the sign is executed with
one hand, an empty string has been assigned to the corre-
sponding unused hand. The distance between signs has
been computed comparing only the strings that belong to
the same hand, and the resulting edit distance is given as
sum of the values obtained for each hand. The edit distance
has been calculated giving an equal cost 1 to characters
insertions or deletions and cost 2 to substitutions.
The relationship between the error rate and the edit
distance has been obtained as follows: denoting with
Dðsi; sjÞ the edit distance between sign si and sj and with
Sd ¼ fðsi; sjÞjDðsi; sjÞ ¼ d; i ¼ 1; . . .;N; j ¼ 1; . . .;N; i 6¼ jg; ð23Þ
the set of sign pairs having edit distance d, the average
error rate for the edit distance d can be defined as:
EðdÞ ¼ 1
jSdjX
8ðsi;sjÞ2Sd
eðsi; sjÞ; ð24Þ
where jSdj indicates the number of elements in Sd, and
eðsi; sjÞ ¼Aði; jÞ
PNj¼1 Aði; jÞ
; i 6¼ j: ð25Þ
In the equation, the term Aði; jÞ indicates the number of
times sign si has been classified as sign sj. The matrix
A ¼ fAði; jÞgNi;j¼1 is the confusion matrix. Figure 9 shows
how EðdÞ varies for the different values of the edit dis-
tance. As expected, there is an inverse relationship between
edit distances and the error rates. Although this is partic-
ularly evident for values greater than 10, this behaviour is
more irregular when the distance is below this value. For
example, the error rate is 10 % when the edit distance is 2,
while it reaches 20 % when the edit distance increases to 5.
The authors presume that this errors can be ascribed to the
overlap between face and hands and between the two
Table 3 Sign recognition accuracy (%) for each A3LIS-147 signer
Signer Iteration
1 2 3 4 5 6 7
FAL 42.76 45.83 47.59 47.44 49.35 51.42 52.03
FEF 15.86 23.83 25.06 25.37 26.52 23.30 23.07
FSF 26.90 31.34 34.02 35.32 33.71 34.48 34.79
MDP 34.03 40.28 44.83 45.99 49.15 52.55 52.39
MDQ 41.38 49.35 51.34 51.49 51.65 52.79 53.94
MIC 38.39 39.69 40.00 41.76 42.84 45.37 45.98
MMR 47.82 52.41 54.94 56.93 54.17 58.70 58.39
MRLA 33.57 41.96 45.22 50.35 54.16 50.19 50.35
MRLB 43.30 56.93 57.47 57.32 57.09 56.93 57.32
MSF 42.68 46.44 45.75 48.58 50.88 51.49 52.34
Average 36.67 42.81 44.62 46.06 46.95 47.72 48.06
Std Dev 9.50 9.80 9.68 9.79 9.80 10.88 11.01
Table 4 Average sign recognition accuracy (%) for different number
of states and gaussians using features 5, 6, 7, 8, 12, 13 and 14
States Gaussians
1 2 3 4
15 48.02 47.32 44.49 41.34
16 47.69 47.11 44.25 41.00
17 47.97 47.20 44.26 40.35
18 48.06 47.02 43.81 40.06
19 47.98 46.63 43.10 39.69
20 46.97 46.31 42.92 38.92
21 46.82 44.65 41.06 37.95
Fig. 8 Overall sign recognition accuracy using 18 states and 1
mixture
Fig. 9 Correlation between edit distance and error rate. The ‘‘X’’
indicates that at positions 1, 3, and 4 that there are no sign pairs
having these distance values
Pattern Anal Applic
123
hands, and to the inability to discriminate the 3-D position
of the subject.
As an example, consider the pairs ‘‘Lunedı - Sabato’’
and ‘‘Lunedı - Venerdı’’. The first has edit distance 2 and
an error rate of 10 %, while the second has edit distance 5,
but the error rate increases to 20 %. Observing their tran-
scription in Table 5, it can be observed that all the signs of
the days of the week have similar arm movements and the
information is entirely contained in the hands shapes.
However, considering the signs ‘‘Lunedı’’ and ‘‘Venerdı’’
the position of the right hand and its overlap with the face
are very similar. This can be noticed observing the frame
sequence excerpts in Fig. 10.
Analysing the results reported in Fig. 8, it is evident that
the worst accuracy values are always scored by the signer
FEF. The authors suppose that the cause is again related to
the overlap between head and hands. In fact, the signer
often executes the signs with the hands closer to the head
region respect to the others signers of the corpus. The
consequence is the increasing of the overlaps number
between head and hands, i.e., loss of useful information.
4.4.1 Related background
In this subsection, the obtained results are discussed in the
light of state of the art contributions currently present in
the literature. Table 6 shows the accuracy obtained by
each method, the characteristics of the corpus employed
in the experiments, whether the recognition is performed
at the word or sentence level, and if the method operates
in a signer dependent or independent scenario. First of all,
let’s consider the first Italian sign language recognition
work, proposed by [24]. As mentioned above, such a
framework deals with the continuous sign case study. It
extracts geometric features, employs a neural network for
the classification stage and uses a database with a
vocabulary of 40 signs. No specific distinction is made
among the signers involved in training and testing of the
system, in contrast to the proposed approach which is
purely signer-independent. The presented results show an
overall accuracy about the correct translated sentences
using two different data set. In the first, the test set is
Fig. 10 Frame sequence excerpts of words ‘‘Lunedı’’ (left) and
‘‘Venerdı’’ (right)
Table 5 Recall values for each day of week sign and the HamNoSys
transcriptions
Pattern Anal Applic
123
composed of 30 sentences with 20 distinct signs, and the
obtained accuracy value is 83:30 %. In the second, the test
set is composed of 80 sentences and 40 distinct signs, and
the accuracy value is 82:50 %.
Looking at the international projects dealing with the
signer-independent isolated sign recognition case study, the
work described in [2] operates using manual features in
controlled environments. Experiments conducted on a
database involving four signers give an average recognition
accuracy of about 45:23 %, which is consistent with the
performance obtained by the HMM system proposed in this
paper, whose accuracy is 48:06 %.
Hidden Markov Model-based recognition applied to
sign language has been proposed for the first time by [35].
To obtain a real-time execution, the signer wear distinctly
coloured gloves on each hand and the frame were deleted
when the tracking of one or both hands was lost. They
executed an experiment using 395 training sentences and
99 independent test sentences with a vocabulary of 40
words. The system achieved an accuracy of 99:2 % using a
grammar modelling and 91:3 % for free grammar. In a
successive study [36], the authors introduced two different
dataset, changed the camera mount point, and recorded the
same sentences. For testing purposes, 384 sentences were
used for training and 94 were reserved for testing. The
testing sentences are not used in any portion of the training
process. For the second person view, camera mounted on
the desk, the performance reach an accuracy of 91:9 %,
with grammatical rules, and 79:5 % unrestricted. As for the
first person view, hat-mounted camera, the obtained accu-
racy is 97:8 %, with grammatical rules, and 96:8 %unrestricted.
In [42] a simple feature combination has been studied to
improve the performance of the recognition system. They
have obtained relevant results with an overall accuracy of
89:09 %, with the combination of PCA, kurtosis position
feature vector and the MCC feature. They also used an
HMM classifier, but a comparison between these results
and the ones described in this paper is not feasible because
of the wide difference between the used databases. More-
over, it must be observed that, as clearly stated by [42, ‘‘the
system vocabulary is restricted to avoid hand over face
occlusion, location interchange of dominant hand and right
hand’’ in their work, in contrast to what done in the present
contribution where all signs have been considered for
training and testing.
Noteworthy are the studies proposed by [18] and [3],
where alternative acquisition techniques have been inves-
tigated. In [18] two CyberGloves and three Pohelmus
3SPACE-position trackers have been used as input devices.
The gloves collect the information about hand shapes, and
the trackers collect the orientation, position and movement
trajectory information. The classification stage is based on
a novel combination of HMMs and self-organizing feature
maps to implement a signer-independent recognizer of
Chinese sign language. All the experiments have been
conducted on a database where 7 signers have performed
208 signs for 3 times. The tests presented an overall
accuracy of 88:20 % for the unregistered test set.
A system for continuous sign language recognition has
been introduced in [3]. The main aspect of the proposed
approach lies in the acquisition level where an electro-
myographic (EMG) data-glove has been used. The EMG
allows a good gesture segmentation, based on muscle
activity. The classification phase is divided in two distinct
levels: gesture recognition and language modelling. The
former one is based on a four channel HMM classifier,
where each channel identify one gesture articulation:
Table 6 Summary of the existing contribution in the state of the art
Contr. Accuracy (%) V/Se/Si Rec. Type Signer Device
[2] 45.23 � 220/–/4 Word Independent Camera
[18] 88.20 208/4368/7 Word Independent Gloves & Trackers
[24] 83.30–82.50 40/30–80/2 Sentence Dependent Camera
[42] 89.09 30/110/3 Sentence Dependent Camera
[35] 99.20–91.30 40/494/1 Sentence Dependent Camera
[36] 91.90–79.50 40/478/1 Sentence Dependent Camera�
[36] 97.80–96.80 40/478/1 Sentence Dependent Camera��
V/Se/Si indicates the vocabulary dimension, the sentences number and the number of signers
* second person view. ** first person view
Pattern Anal Applic
123
movement, hand shape, orientation, and rotation. The latest
one uses a single channel HMM classifier to match a rec-
ognized gesture with a model sentence based on language
grammar and semantic. Unfortunately, the system has been
tested only on an authors’ own database and no results are
reported.
4.5 Computational complexity
Here a detailed computational complexity analysis of the
algorithms of the automatic LIS recognition system is
discussed and the real-time factor related to the framework
implementation on a common hardware/software platform
is provided.
The FHD stage is characterized by two main algorithms:
skin detection and morphological smoothing. The Feature
Extraction stage is characterized by the Canny edge
detector and the operation required to calculate the features
of Table 2, which are directly performed on the image
regions determined by the FHD stage on each video frame.
Table 7 reports the number of FLoating-point Operations
Per Second (FLOPS) of each stage, as well as their time
occupancy respect to the total execution time. Consider
that in calculating the number of FLOPS, multiplications/
divisions and additions/subtractions have been equally
weighted. Note also that for those operations it is assumed
that the skin regions are composed of 30 % of the overall
amount of image pixels. This value coincides with the
worst case in A3LIS-147 the database. The equivalent time
consumed for each operation is based on the performance
achieved on a Toshiba Satellite L500D-159 laptop equip-
ped with an AMD Athlon II Dual-Core M300x2 running at
2 GHz and with 4 GB of RAM. It is worth reminding that
the number of frame-per-second of the database videos is
25.
Looking at the values reported in Table 7, it is evident
that the Canny Filter represents the bottleneck in the FHD
and Feature Extraction stages from the computational
complexity perspective, and that there is thus space to
calculate other feature based on the Canny outcome and/or
evaluating more performing skin detection procedures.
Note that also the morphological smoothing can induce a
certain burden, in dependence on the size of the structuring
element: in the present case studies, its size is equal to 9
and bigger values were not beneficial to improve overall
system performance.
For what concern the recognition stage, the Viterbi
algorithm [30] requires most of the computational
resources. Indeed, such algorithm is executed for each
sign and to calculate the probability that the input feature
vectors are generated by the sign model. The computa-
tional cost of these operations depends on the number of
HMM states N, the number of signs in the vocabulary V
and the number of frames per second T of the video input
to recognize. Moreover, it is linearly dependent on the
operations needed to calculate the observation density,
i.e., a mixture of G Gaussians whose observation vector
has size L. As pointed out in [30], the computational cost
is given by:
FLOPS VTN2ð6Lþ 2ÞG: ð26Þ
Therefore, assuming N ¼ 18, G ¼ 1, L ¼ 21, (as in best
results reported above) and V ¼ 147, T ¼ 25 (due to
database characteristics), the computational cost is ap-
proximatively 152.4 MFLOPS, thus greater than the one
attained in the two former stages.
The computational burden of the recognition process has
been evaluated also in terms of real-time factor, defined as
the ratio between the processing time and the length of the
video sequence. On average, the whole process, including
the features calculation and the Viterbi algorithm for sign
recognition, requires about 8 s per video segment. This
results in a real-time factor of about 2.6, which represents
an encouraging result for future real-time developments
also on embedded platforms. Indeed, considering that no
specific optimization strategies have been carried out so
far, space for improvements can be foreseen from this
perspective.
5 Comparative evaluation with SVMs
To have better insights on the performance of the HMM-
based system, an SVM approach has been considered [10].
SVMs are binary classifiers that discriminate whether an
input vector x belongs to class þ1 or to class �1 based on
the following discriminant function:
Table 7 Computational Cost in MFLOPS (Mega-FLOPS) and time
consumed for all algorithms in the Automatic LIS Recognition tool
MFLOPS Time occupancy (%)
Skin detector 31:1 10:00
Morphological smoothing 112:0 25:00
Canny filter 134:5 30:00
Feature operations 6:2 1:67
Recognition 152:4 33:33
The ‘‘Feature Operations’’ includes all formulas appearing in Table 2,
the calculation of moments which are used therein and the face
smoothing described in 5
Pattern Anal Applic
123
f ðxÞ ¼XN
i¼1
aitiKðx; xiÞ þ d; ð27Þ
where ti 2 fþ1;�1g, ai [ 0 andPN
i¼1 aiti ¼ 0. The
terms xi are the ‘‘support vectors’’ and d is a bias term
that together with the ai are determined during the
training process of the SVM. The input vector x is
classified as þ1 if f ðxÞ 0 and �1 if f ðxÞ\0. The
kernel function Kð�; �Þ can assume different forms [5].
Regardless the kernel, the basic assumption is that all
the input vectors are of the same dimension. However,
in the sign recognition scenario, the video sequences are
composed of a varying number of frames, thus the
number of feature vectors changes from video to video.
A popular approach to deal with the problem is using
the so called alignment kernels.
In this work, the Gaussian Dynamic Time Warping
kernel (GDTW) [21] has been employed. The kernel
assumes the form:
Kðx; xjÞ ¼ expð�cDTWðx; xjÞÞ; c[ 0; ð28Þ
Notice that GDTW is a variant of the radial basis function
kernel Kðx; xiÞ ¼ expð�ckx� xik2Þ, where the term kx�xik2
is replaced with Dynamic Time Warping [29], thus
able to deal with variable length sequences.
Since SVMs are binary classifiers, the multiclass prob-
lem has been addressed using the ‘‘one versus all’’ strategy.
All the experiments have been performed using LIBSVM
(a library for Support Vector Machines) [9].
As for the HMM model parameters selection illustrated
in Sect. 4.1, the first test aims at evaluate the best c value to
use for the feature selection tests. Figure 11 shows the
obtained results varying c from 2�15 to 217. The best
accuracy, 24:93 %, is reached for c ¼ 29. All the test have
been executed following the three-way data split method
combined with the 10-fold cross validation as for the
HMM-based system.
Figure 12 shows the final performance obtained using
the same features of the HMM approach: the resulting
accuracy is 21.49 % below the one obtained with the HMM
system.
Since the adopted SFS procedure depends on the clas-
sifier, it has been repeated for the SVM approach to find the
best performing features. The resulting features for SVM
are the normalized hands centroid distance and the hands
centroid referred to the face centroid. Figure 13 shows the
obtained results: the average accuracy of the SVM
improves to 34:14 %; however, it is still below the one
obtained with the HMM approach, 48:06 %.
6 Conclusion
In this work, a system for the automatic recognition of
isolated signs of the Italian sign language has been
Fig. 11 Sign recognition accuracy for different values of c using the
three-way data split and the K-fold cross-validation methods
Fig. 12 Overall sign recognition accuracy using the same features of
the HMM system
Fig. 13 Overall sign recognition accuracy using optimal features for
the SVM approach (i.e., features 6 and 7)
Pattern Anal Applic
123
described and evaluated. The approach operates inde-
pendently of the signer, and it is based on three distinct
algorithmic stages operating in cascade: the first for face
and hands detection, the second for feature extraction and
the third for sign recognition, based on the HMS para-
digm. The authors, with the support of the local office of
the Deaf People National Institute (ENS), have created a
suitable database made of 1,470 signs to be used for the
training and testing phases of the recognizer. In the
experiments, several features have been evaluated for
classification and the overall best performing set has been
determined using the sequential feature selection
technique.
Performed computer simulations have shown that it is
possible to reach an average recognition accuracy of
almost 50 %, which is surely compliant with the results
obtained by recognition systems proposed in the litera-
ture for other sign languages and in similar experimental
case studies. A detailed discussion on the obtained
results has been provided analysing the correlation
between the edit distance of the HamNoSys transcrip-
tions of the signs and the misclassification rates. The
study demonstrated that they are inversely proportional,
and that when the edit distance is low the main sources
of errors are the overlap between hands, and between
hands and face.
To prove the effectiveness of the HMM-based sys-
tem, a comparative evaluation with an SVM-based
approach has been conducted. Two different tests have
been executed to evaluate the SVM performance. In the
first, the same features used in the model refinement
experiment of the HMM approach have been used. In
the latter, a new set of features has been chosen to
achieve the SVM best performance. In both cases, the
accuracy values are below the one obtained with the
HMM system.
As future works, different issues will be addressed to
achieve an overall improvement of the recognition
accuracy. As reported in [39], better skin detection
techniques rather than the one available in the OpenCV
library could be used to improve the discrimination of
the regions of interest. Another substantial adjustment
concerns the problem of the overlap and the demand to
increase the shape detection. These aspects will be
handled by investigating the approaches proposed in
[28] and [25]. The introduction of the PCA technique
proved as a good approach to reduce the dimensionally
of the frame in vision appearance features. In fact, the
technique has been widely used for different frame-
works [2, 20, 42] and seems to have a beneficial influ-
ence. Another aspect to investigate could be the
introduction of different hands tracker, like those pro-
posed by [3, 18].
Moreover, the development of a new sign database
including audio and spatial information (e.g., using
multiple cameras, or Microsoft KinectTM) to augment
the feature set and reduce errors due to overlapping and
wrong spatial detection of the subject is actually under
investigation. Clearly, in the new database it will be
necessary to increase the number of vocabulary signs
and to introduce more realizations of each sign per
signer. The proposed approach will be also assessed
using on sign corpora to compare the results with
alternative techniques, and to verify the performance on
different scenarios.
Finally, the authors have been also directing their
efforts in extending the usability of the tool to the
continuous sign case study, which requires the intro-
duction of a suitable language model within the recog-
nizer stage. The HMM-based approach used in the
proposed framework surely represents a good choice
from this perspective, as effectively demonstrated by
many continuous speech recognition applications [33,
35, 36].
Appendix: HamNoSys Sign Transcription
Pattern Anal Applic
123
Pattern Anal Applic
123
Pattern Anal Applic
123
References
1. von Agris, U, Kraiss, KF (2007) Towards a video corpus for
signer-independent continuous sign language recognition. In:
Proceedings of the International Workshop on Gesture in
Human–Computer Interaction and Simulation, Lisbon
2. von Agris U, Zieren J, Canzler U, Bauer B, Kraiss KF (2008)
Recent developments in visual sign language recognition. Uni-
vers Access Inf Soc 6(4):323–362
3. Al-Ahdal M, Tahir N (2012) Review in sign language recognition
systems. In: Proceedings of IEEE Symposium on Computers
Informatics, pp 52–57
4. Bertoldi N, Tiotto G, Prinetto P, Piccolo E, Nunnari F, Lombardo
V, Mazzei A, Damiano R, Lesmo L, Del Principe A (2010) On
the creation and the annotation of a large-scale Italian-LIS par-
allel corpus. In: Proceedings of the International Conference on
Language Resources and Evaluation, pp 19–22
5. Bishop C (2006) Pattern recognition and machine learning.
Springer Science?Business Media, LLC, New York
6. Bradski G (2000) The OpenCV library. Dr Dobb’s J Softw Tools
25(11):120 (122–125)
7. Bradski G, Kaehler A (2008) Learning OpenCV: computer vision
with the OpenCV library. O’Reilly Media Inc., Sebastopol
8. Canny J (1986) A computational approach to edge detection.
IEEE Trans Pattern Anal Mach Intell 8(6):679–698
9. Chang CC, Lin CJ (2011) LIBSVM: A library for support vector
machines. ACM Trans Intell Syst Technol 2:1–27
10. Cortes C, Vapnik V (1995) Support-vector networks. In: Machine
Learning, pp 273–297
11. Dreuw P, Forster J, Deselaers T, Ney H (2008) Efficient
approximations to model-based joint tracking and recognition of
continuous sign language. In: Proceedings of the IEEE Interna-
tional Conference on Automatic Face Gesture Recognition, Los
Alamitos, pp 1–6
12. Dreuw P, Neidle C, Sclaroff VAS, Ney H (2008) Proc Int Conf
Lang Resour Eval., Benchmark databases for video-based auto-
matic sign language recognitionEuropean Language Resources
Association, Marrakech
13. Dreuw P, Ney H (2008) Visual modeling and feature adaptation
in sign language recognition. In: ITG Conference on Voice
Communication (SprachKommunikation), pp 1–4
14. Dreuw P, Ney H, Martinez G, Crasborn O, Piater J, Moya JM,
Wheatley M (2010) The signspeak project—bridging the gap
between signers and speakers. In: Proceedings of the Interna-
tional Conference on Language Resources and Evaluation.
Valletta, Malta
15. Dreuw P, Rybach D, Deselaers T, Zahedi M, Ney H (2007)
Speech recognition techniques for a sign language recognition
system. In: Proceedings of Interspeech, pp 2513–2516
16. Fagiani M, Principi E, Squartini S, Piazza F (2012) A new
Italian sign language database. In: Zhang H, Hussain A, Liu D,
Wang Z (eds) Advances in brain inspired cognitive systems,
Lecture Notes in Computer Science, vol 7366. Springer,
pp 164–173
17. Fagiani M, Principi E, Squartini S, Piazza F (2013) A new system
for automatic recognition of italian sign language. In: Apolloni B,
Bassis S, Esposito A, Morabito FC (eds) Neural nets and sur-
roundings, smart innovation, systems and technologies, vol 19.
Springer, Berlin, pp 69–79
18. Fang G, Gao W, Ma J (2001) Signer-independent sign language
recognition based on SOFM/HMM. In: Proceedings of the IEEE
ICCV workshop on recognition, analysis, and tracking of faces
and gestures in real-time systems, pp 90–95
19. Gonzalez RC, Woods RE (2007) Digital image processing, 3rd
edn. Prentice Hall, USA
20. Gweth Y, Plahl C, Ney H (2012) Enhanced continuous sign
language recognition using PCA and neural network features. In:
Proceedings of the computer vision and pattern recognition
workshops, pp 55–60
21. Haasdonk B, Bahlmann C (2004) Pattern recognition. In: Ras-
mussen C, Bulthoff HH, Scholkopf B, Giese MA (eds) Learning
with distance substitution kernels., Lecture notes in computer
scienceSpringer, Berlin, pp 220–227
22. Hanke T (2004) HamNoSys-representing sign language data in
language resources and language processing contexts. In: Pro-
ceedings of LREC, pp 1–6
23. Hu MK (1962) Visual pattern recognition by moment invariants.
IRE Trans Inf Theory 8(2):179–187
24. Infantino I, Rizzo R, Gaglio S (2007) A framework for sign
language sentence recognition by commonsense context. IEEE
Trans Syst Man Cybern C Appl Rev 37(5):1034–1039
25. Kelly D, McDonald J, Markham C (2010) A person independent
system for recognition of hand postures used in sign language.
Pattern Recognit Lett 31(11):1359–1368
26. Kukharev G, Nowosielski A (2004) Visitor identification—
elaborating real time face recognition system. In: Proceedings of
the international conference on computer graphics, visualization
and computer vision. UNION Agency, Plzen, pp 157–164
27. Maebatake M, Suzuki I, Nishida M, Horiuchi Y, Kuroiwa S
(2008) Sign language recognition based on position and move-
ment using multi-stream HMM. In: Proceedings of the interna-
tional symposium on universal communication. Los Alamitos,
pp 478–481
28. Quan Y (2010) Chinese sign language recognition based on video
sequence appearance modeling. In: Proceedings of the IEEE
Conference on Industrial Electronics and Applications,
pp 1537–1542
29. Rabiner L, Juang BH (1993) Fundamentals of speech recognition.
Prentice-Hall Inc., Upper Saddle River
30. Rabiner LR (1989) A tutorial on hidden Markov models and selected
applications in speech recognition. Proc IEEE 77(62):257–286
31. Ripley BD (1996) Pattern recognition and neural networks.
Cambridge University Press, Cambridge
32. Sandler W, Lillo-Martin D (2006) Sign language and linguistic
universals. Cambridge University Press, Cambridge
33. Saon G, Chien JT (2012) Large-vocabulary continuous speech
recognition systems: a look at some recent advances. IEEE Signal
Process Mag 29(6):18–33
34. Serra J (1983) Image analysis and mathematical morphology.
Academic Press Inc., Orlando
35. Starner T, Pentland A (1995) Real-time American sign language
recognition from video using hidden Markov models. In: Pro-
ceedings of the international symposium on computer vision,
pp 265–270
36. Starner T, Weaver J, Pentland A (1998) Real-time American sign
language recognition using desk and wearable computer based
video. IEEE Trans Pattern Anal Mach Intell 20(12):1371–1375
37. Theodorakis S, Katsamanis A, Maragos P (2009) Product-HMMs
for automatic sign language recognition. In: Proceedings of the
international conference on acoustics, speech and signal pro-
cessing. Taipei, Taiwan, pp. 1601–1604
38. Theodoridis S, Koutroumbas K (2008) Pattern recognition, 4th
edn. Academic Press, Burlington
39. Vezhnevets V, Sazonov V, Andreeva A (2003) A survey on
pixel-based skin color detection techniques. In: Proceedings of
GraphiCon, pp 85–92
40. Young SJ, Evermann G, Gales MJF, Hain T, Kershaw D, Moore
G, Odell J, Ollason D, Povey D, Valtchev V, Woodland PC
(2006) The HTK Book, version 3.4. Cambridge University Press,
Cambridge
Pattern Anal Applic
123
41. Zahedi M, Keysers D, Deselaers T, Ney H (2005) Pattern rec-
ognition. In: Kropatsch W, Sablatnig R, Hanbury A (eds) Com-
bination of tangent distance and an image distortion model for
appearance-based sign language recognition., Lecture Notes in
Computer ScienceSpringer, Berlin, pp 401–408
42. Zaki MM, Shaheen SI (2011) Sign language recognition using a
combination of new vision based features. Pattern Recognit. Lett.
32(4):572–577
Pattern Anal Applic
123