signer independent isolated italian sign recognition based on hidden markov models

18
SHORT PAPER Signer independent isolated Italian sign recognition based on hidden Markov models Marco Fagiani Emanuele Principi Stefano Squartini Francesco Piazza Received: 5 February 2013 / Accepted: 22 July 2014 Ó Springer-Verlag London 2014 Abstract Sign languages represent the most natural way to communicate for deaf and hard of hearing. However, there are often barriers between people using this kind of languages and hearing people, typically oriented to express themselves by means of oral languages. To facilitate the social inclusiveness in everyday life for deaf minorities, technology can play an important role. Indeed many attempts have been recently made by the scientific com- munity to develop automatic translation tools. Unfortu- nately, not many solutions are actually available for the Italian Sign Language (Lingua Italiana dei Segni—LIS) case study, specially for what concerns the recognition task. In this paper, the authors want to face such a lack, in particular addressing the signer-independent case study, i.e., when the signers in the testing set are to included in the training set. From this perspective, the proposed algorithm represents the first real attempt in the LIS case. The auto- matic recognizer is based on Hidden Markov Models (HMMs) and video features have been extracted using the OpenCV open source library. The effectiveness of the HMM system is validated by a comparative evaluation with Support Vector Machine approach. The video material used to train the recognizer and testing its performance consists in a database that the authors have deliberately created by involving 10 signers and 147 isolated-sign videos for each signer. The database is publicly available. Computer simulations have shown the effectiveness of the adopted methodology, with recognition accuracies com- parable to those obtained by the automatic tools developed for other sign languages. Keywords Italian sign language (LIS) Automatic sign recognition Video feature extraction Hidden Markov models 1 Introduction Deaf and hard of hearing widely use sign languages to communicate [32]. These languages have arisen sponta- neously and evolved rapidly, all over the globe. As widely accepted by the scientific community, they are character- ized by a relevant relationship with the geographic location where they are used and by a strong independence on the oral languages diffused in the same region. It follows that different sign languages have arisen and have been offi- cially recognized in most countries, like United States, United Kingdom, Germany, and China. As for the oral case, a large variety of dialects can be found within the area of influence of the ‘‘official’’ sign language and this can give origin to misunderstanding problems among people living in different places in that area. Beside this aspect, the communication between deaf and hearing people is most of the time troublesome, because people using oral languages are not typically well disposed to learn a sign language, even recognizing that this could represent an effective way to enrich their com- munication capabilities. This inevitably induces the estab- lishment of social inclusion barriers, which the entire population is called to face and likely overcome. The impact of these problems increases considering that the interpreters, with their indispensable work, cannot be always present and that less of the 0.1 % of total popula- tion belong to deaf community. Scientists and technicians from all over the world have recently addressed the issue and several solutions have M. Fagiani E. Principi (&) S. Squartini F. Piazza Department of Information Engineering, Universita ` Politecnica delle Marche, Via Brecce Bianche 1, 60131 Ancona, Italy e-mail: [email protected] 123 Pattern Anal Applic DOI 10.1007/s10044-014-0400-z

Upload: francesco

Post on 16-Feb-2017

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Signer independent isolated Italian sign recognition based on hidden Markov models

SHORT PAPER

Signer independent isolated Italian sign recognition basedon hidden Markov models

Marco Fagiani • Emanuele Principi •

Stefano Squartini • Francesco Piazza

Received: 5 February 2013 / Accepted: 22 July 2014

� Springer-Verlag London 2014

Abstract Sign languages represent the most natural way

to communicate for deaf and hard of hearing. However,

there are often barriers between people using this kind of

languages and hearing people, typically oriented to express

themselves by means of oral languages. To facilitate the

social inclusiveness in everyday life for deaf minorities,

technology can play an important role. Indeed many

attempts have been recently made by the scientific com-

munity to develop automatic translation tools. Unfortu-

nately, not many solutions are actually available for the

Italian Sign Language (Lingua Italiana dei Segni—LIS)

case study, specially for what concerns the recognition

task. In this paper, the authors want to face such a lack, in

particular addressing the signer-independent case study,

i.e., when the signers in the testing set are to included in the

training set. From this perspective, the proposed algorithm

represents the first real attempt in the LIS case. The auto-

matic recognizer is based on Hidden Markov Models

(HMMs) and video features have been extracted using the

OpenCV open source library. The effectiveness of the

HMM system is validated by a comparative evaluation

with Support Vector Machine approach. The video material

used to train the recognizer and testing its performance

consists in a database that the authors have deliberately

created by involving 10 signers and 147 isolated-sign

videos for each signer. The database is publicly available.

Computer simulations have shown the effectiveness of the

adopted methodology, with recognition accuracies com-

parable to those obtained by the automatic tools developed

for other sign languages.

Keywords Italian sign language (LIS) � Automatic sign

recognition � Video feature extraction � Hidden Markov

models

1 Introduction

Deaf and hard of hearing widely use sign languages to

communicate [32]. These languages have arisen sponta-

neously and evolved rapidly, all over the globe. As widely

accepted by the scientific community, they are character-

ized by a relevant relationship with the geographic location

where they are used and by a strong independence on the

oral languages diffused in the same region. It follows that

different sign languages have arisen and have been offi-

cially recognized in most countries, like United States,

United Kingdom, Germany, and China.

As for the oral case, a large variety of dialects can be

found within the area of influence of the ‘‘official’’ sign

language and this can give origin to misunderstanding

problems among people living in different places in that

area. Beside this aspect, the communication between deaf

and hearing people is most of the time troublesome,

because people using oral languages are not typically well

disposed to learn a sign language, even recognizing that

this could represent an effective way to enrich their com-

munication capabilities. This inevitably induces the estab-

lishment of social inclusion barriers, which the entire

population is called to face and likely overcome. The

impact of these problems increases considering that the

interpreters, with their indispensable work, cannot be

always present and that less of the 0.1 % of total popula-

tion belong to deaf community.

Scientists and technicians from all over the world have

recently addressed the issue and several solutions have

M. Fagiani � E. Principi (&) � S. Squartini � F. Piazza

Department of Information Engineering, Universita Politecnica

delle Marche, Via Brecce Bianche 1, 60131 Ancona, Italy

e-mail: [email protected]

123

Pattern Anal Applic

DOI 10.1007/s10044-014-0400-z

Page 2: Signer independent isolated Italian sign recognition based on hidden Markov models

been proposed to facilitate the usage of sign languages to

deaf and also the interaction among them and hearing

people in different communication scenarios. Many auto-

matic translation systems (Sign-to-Sign, Sign-to-Oral, or

Oral-to-Sign) have been developed, taking the specificity

of the examined languages into account. Typically, these

systems require two separate tasks: recognition and syn-

thesis. In this work, the focus is specifically on the former

and a brief state of the art of existing methodologies for

automatic recognition of sign languages and related dat-

abases is provided in the following, first from an interna-

tional perspective and then looking closer at the Italian

context.

In the international scenario many research projects can

be found. Starner and Pentland [35] have been the first to

propose a system based on Hidden Markov Models

(HMMs) for sign language recognition. The approach is

capable of performing real-time sentence recognition of the

American Sign Language using only hands information

such as position, angle of axis of least inertia, and eccen-

tricity of bounding ellipse. In a later study [36], the authors

presented a real-time continuous sentence-level recognition

based on two different point of view for the camera: a desk-

based and a cap-mounted. In [2] a sign recognition system

in signer-independent and signer-dependent conditions has

been investigated. Both hands-related features, based on a

multiple hypotheses tracking approach, and face-related

features, based on an active appearance model, are robustly

extracted. The classification stage is based on HMM and is

designed for the recognition of both isolated and continu-

ous signs. The work in [37] proposed a so-called product

HMM for efficient multi-stream fusion, while [27] dis-

cussed the optimal number of the states and mixtures to

obtain the best accuracy.

SignSpeak [14] is one of the first European funded

projects that tackles the problem of automatic recognition

and translation of continuous sign languages. Among the

works of the project, Dreuw and colleagues in [11–13]

introduced important innovations: they used a tracking

adaptation method to obtain a hand tracking path with

optimized tracking position, and they trained robust models

through visual signer alignments and virtual training

samples.

In [42], Zaki and Shaheen presented a new combination

of appearance-based features obtained using Principal

Component Analysis (PCA), the kurtosis position and

Motion Chain Code (MCC). First of all, they detect the

dominant hand using a skin colour detection algorithm that

is then tracked by a connected component labelling. After

that, PCA provides a measure for the hand configuration

and orientation, kurtosis position is used for measuring the

edge which denotes the place of articulation and MCC

represents the hand movement. Several combinations of the

proposed features are tested and evaluated, and an HMM

classifier is used to recognize signs. Experiments are per-

formed using the RWTH-BOSTON-50 database of isolated

signs [41]. In [25, 28] two approaches based on Support

Vector Machines (SVM) for the recognition of the sole

hand gesture have been proposed. In the first, a multi-

feature classifier combined with spatio-temporal charac-

teristics has been used to realize an automatic recognizer of

sign language letters. In the second, a combination of ad

hoc weight eigenspace Size Function and Hu moments has

been used.

Taking a closer look at the databases for international

sign languages, in [41], as mentioned above, and [15] two

free automatic sign language databases are presented, both

created as subset of the general database recorded at the

Boston University1. The video sequences have been

recorded at 30 frames per second (fps) with a resolution of

312� 242 pixels. Sentences have been executed by three

different people, two men and one woman, all dressed

differently and with different brightness of the clothes. The

first database, RWTH-BOSTON-50, is made of 483 utter-

ances of 50 signs, with each utterance stored in a separate

video file to create a video collection of isolated signs. The

second corpus, RWTH-BOSTON-104, is made of 201

sentences (161 for training and 40 for testing) and the

dictionary contains 104 different signs. Unlike the first one,

each video contains a sequence of signs. The SIGNUM [1]

corpus is composed both of isolated sign and of sign

sequences executed by 25 persons, and recordings have

been conducted in controlled conditions, to facilitate the

video feature extraction. The videos have been recorded

with a resolution of 780� 580 pixels at 30 fps. The dic-

tionary is made of 450 DGS sign composing a total of 780

sentences with sign sequences.

Regarding the Italian scenario, Infantino et al. [24]

presented a signer-dependent recognition framework

applied to Italian sign language (in Italian Lingua Italiana

Segni, LIS). Features are composed of centroid coordinates

of the hands and classification is based on a self-organizing

map neural network. In addition, the framework integrates

a common-sense engine that selects the right meaning

depending on the context of the whole sentence. Up to the

authors’ knowledge, this is the only work appeared in the

literature so far that deals with the Italian sign language

recognition. For what concern the LIS databases, an

interesting work has been recently presented within the

ATLAS2 project [4], that describes the development of a

trilingual corpus of the Italian sign language. The database

will be used to realize a virtual interpreter, which auto-

matically translates an Italian text into LIS, and it is

1 http://www.bu.edu/asllrp/ncslgr.html.2 http://www.atlas.polito.it/index.php/en/home.

Pattern Anal Applic

123

Page 3: Signer independent isolated Italian sign recognition based on hidden Markov models

publicly available to the community. Actually, no infor-

mation about the number of signers, recording conditions,

video properties, and signs dictionary are provided. As

stated by [4], it includes many recordings with thousands of

signs, but they are executed by single signers thus repre-

senting a limitation in the development of automatic sign

recognition systems.

The objective of this work is proposing an automatic

recognition system of isolated LIS signs. Differently from

[24], the system operates independently from the signer,

i.e., it is able to deal with signers not included in the

training phase. An HMM approach has been employed on

purpose, inspired to other existing tools for oral languages

and also for some international sign languages, as men-

tioned above. The isolated sign case study has been

addressed. Due to the lack of a suitable database to train

and evaluate the system, a specific LIS database has been

created in collaboration with the local office of the Deaf

People National Institute (Ente Nazione dei Sordi—ENS).

The database, namely A3LIS-147, is composed of more

than 1,400 signs performed by 10 different signers and it is

publicly available. The proposed recognizer has been par-

tially presented in [17], and here a more detailed descrip-

tion of the related algorithms and insights on the attainable

performance are provided. In particular, here the algorithm

parameters have been optimized to further improve the

recognition performance, and for an additional validation

the approach has been also compared to an SVM Machine

recognizer. In addition, each sign of the corpus has been

transcribed to the HamNoSys transcription system and a

thorough discussion on the correlation between the edit

distance of the signs transcriptions and the error rates has

been provided. As pointed out in the following sections, the

obtained recognition accuracy (superior to 48 %) is con-

sistent with the results obtained by alternative techniques

related to other sign languages, thus proving the effec-

tiveness of the approach.

This is the paper outline. In Sect. 2, the main charac-

teristics of the A3LIS-147 Database are reviewed. Sec-

tion 3 reports the algorithms in the two stages composing

the automatic LIS recognition system, i.e., the Feature

Extractor and the HMM-based recognizer, whereas in

Sect. 4 the experimental tests are described and related

results are commented. Section 5 presents a comparison

between the proposed approach and the SVM method.

Section 6 concludes the paper providing also some future

works highlights.

2 A3LIS-147: Italian sign language database

The creation of a signer-independent automatic sign rec-

ognition system requires a suitable corpus for training its

parameters and assessing its performance. For the isolated

sign recognition task, the corpus should contain videos

with many different signs executed by multiple signers.

The lack of a well-suited corpus for the Italian sign lan-

guage led the authors to create a new one from scratch.

The database, A3LIS-147, is freely available3 and has

been presented in [16]. It consists of 147 distinct isolated

signs executed by 10 different persons (7 males and 3

females). Each signer executed all the 147 signs, for a total

of 1,470 recorded videos. The ENS4 supported the authors

of this work to suitably pre-train the subjects and to choose

a meaningful and unambiguous set of signs to be executed.

In A3LIS-147, signs are divided in six categories, each

related to different real-life scenarios (Table 1): public

institute (e.g., ‘‘employee’’), railway station (e.g., ‘‘train’’),

3 http://www.a3lab.dii.univpm.it/projects/a3lis.4 http://www.ensancona.it.

Table 1 The proposed real-life scenarios, number of signs per sce-

nario, and vocabulary examples

Scenario Signs per

scenario

Vocabulary examples

Public

institute

39 Employee, municipality,

timetable

Railway

station

35 Train, ticket, departure

Hospital 19 Emergency, doctor, pain

Highway 8 Traffic, toll booth, delays

Common life 16 Water, telephone, rent

Education 30 School, teacher, exam

Fig. 1 The ‘‘silence’’ position

Pattern Anal Applic

123

Page 4: Signer independent isolated Italian sign recognition based on hidden Markov models

hospital (e.g., ‘‘emergency’’), highway (e.g., ‘‘traffic’’),

common life (e.g., ‘‘water’’), and education (e.g.,

‘‘school’’). In each video, the person executes a single sign,

preceded and ensued by the ‘‘silence’’ sign shown in Fig. 1.

This sign has been chosen in agreement with the ENS,

since it represents a common ‘‘rest’’ position in everyday

conversations. Recordings have been supported by a

sequence of slides projected in front of the person exe-

cuting the sign. Each slide shows the text and the picture of

the current sign as well as a timer to discriminate the three

phases of the execution: silence, sign execution, silence.

Recordings have been performed in controlled condi-

tions: as Fig. 1 shows, the signers wear dark long-sleeved

shirts, a green chroma-key background is placed behind

them, and uniform lighting is provided by two diffuse

lights. Videos have been captured at 25 fps with a resolu-

tion of 720� 576 pixel and stored in Digital Video format.

For additional details on A3LIS-147, please refer to [16].

3 Description of the proposed system

The structure of the proposed system is shown in Fig. 2.

Similarly to the ‘‘isolated word recognition’’ task in speech

recognition, the system performs isolated sign recognition,

i.e., it recognizes a single sign present in a input video. The

main processing blocks of the system are: Face and Hands

Detection (FHD), Features Extraction (FE), and Sign

Recognition (SR).

The FHD block discriminates the regions of interest in

each video frame. The regions are represented by the hands

and the face of a person, since they convey the information

needed to discriminate signs. Features are then extracted in

the FE stage from the pixels belonging to these areas, and

for each input frame a single feature vector is created. The

classification stage is based on HMM. For each sign, an

HMM is created in the training phase using signs executed

by different persons. The SR phase operates calculating the

output probabilities for each model, and then selecting the

model which scored best.

3.1 FHD

The detection of hands and face areas is performed on each

input frame using a skin detection algorithm. This operates

on the YCbCr colour space since it gives better perfor-

mance respect to the Red Green Blue colour space [19]. A

pixel whose coordinates are ðY;Cb;CrÞ belongs to the skin

region if Y [ 80, 85\Cb\135 and 135\Cr\180 with Y ,

Cb, Cr 2 ½0; 255� [26]. Pixels that satisfy this condition are

included in the skin mask image.

Face and hands are then determined choosing the

regions with largest areas (Fig. 3b). These areas can be

inhomogeneous, so they are merged by means of ‘‘mor-

phological transformations’’. Notice that here ‘‘morpho-

logical transformations’’ refer to image processing

operations, and they must not be confused with the sign

language morphology. These operations consist in a clos-

ing operation, that is produced by the combination of the

dilation and erosion operators [7, 34]. Dilation consists of

the convolution of an image with a kernel (e.g., a small

solid square or disk with the anchor point at its center)

which acts as a local maximum operator. The effect is that

bright regions of an image widen. Erosion is the converse

operation and has the effect of reducing the area of bright

regions. The closing operation employed here is produced

first dilating and then eroding the areas detected as skin.

The final effect is the removal of areas that arise from

noise, and the union of nearby areas that belong to the same

region of interest.

The overlap of two or more regions of interest results in

loss of information. If one hand overlaps with the other or

with the face, it can be difficult to determine its position.

This is why such situation has to be carefully detected and

handled. In this work, the distance between the camera and

the signer is supposed to vary by a small percentage from

frame to frame. Considering also that the skin mask con-

tains three areas when overlap does not occur, an overlap

can be detected following these rules:

Fig. 2 The proposed isolated sign recognition system

Pattern Anal Applic

123

Page 5: Signer independent isolated Italian sign recognition based on hidden Markov models

1. If only one skin region is present in the skin mask, a

face–hands overlap is detected;

2. Otherwise, if two regions are present and the face area

increased by a certain percentage respect to the

previous frame, a face–hand overlap is detected;

3. Otherwise, if two regions are present and the face area

did not increase, a hand–hand overlap is detected;

4. Otherwise, if three regions are present, no overlap is

detected.

When a hand–hand overlap is detected, the system assigns

the same area to both hands, whereas, in the event of a

face–hand overlap, it assigns the face area to the over-

lapped hand. Whenever a face–hands overlap is detected,

both hands areas coincide with the face one.

3.2 Feature extraction

Table 2 shows the entire feature set calculated from the

regions of interest. The set can be divided in two dif-

ferent subsets: the first comprises features 1–8 and

includes the centroid coordinates of the hands respect to

the face, their derivatives and the distance between the

two hands. The second subset comprises features 9–17

and represents the general characteristics of the regions,

such as the area, the shape type, and the 2-D spatial

orientation of the hands.

Features 1–3 and features 4–5 as well as features 9 and

10 differ for the region of interest employed in their

computation: features 1–3 and feature 9 exploit the Canny

algorithm [8] applied within the bounding rectangle of the

hands. Features 4–8 and feature 10 use directly the con-

tours segmented from the skin mask. Figure 4 shows an

example extracted from the A3LIS-147 database, where the

detected area rectangles and the obtained contours are

displayed. The contours are then used to calculate the

regions centroid and the central moments. The centroid is

calculated as follows:

x ¼ m10

m00

; ð1Þ

y ¼ m01

m00

; ð2Þ

where mij are the spatial moments defined from the pixel

intensities Iðx; yÞ as:

mij ¼X

x

X

y

Iðx; yÞxiyj: ð3Þ

In the equation, x and y are the pixel abscissa and ordinate

and Ið�; �Þ 2 ½0; 255�. The central moments li;j are defined

as:

lij ¼X

x

X

y

Iðx; yÞðx� xÞiðy� yÞj: ð4Þ

Regarding the face centroid ðxf ; yf Þ, its position is almost

stationary, but an overlap with the hands can cause

incongruous deviations. This effect has been reduced

smoothing the centroid coordinates at frame t as follows:

ðx0f ; y0f Þt ¼ bðxf ; yf Þt þ ð1� bÞðx0f ; y0f Þt�1; ð5Þ

Fig. 3 A frame of the sign

‘‘casa’’. a The original video

frame. b The detected skin

regions. c The skin regions as

modified after the

morphological transformation

Table 2 The entire feature set

Features

1 Hands centroid normalized respect face region width (Canny-

Filter)

2 Derivatives of feature 1

3 Normalized hands centroid distance (Canny-Filter)

4 Hands centroid normalized respect face region width

5 Derivatives of feature 4

6 Normalized hands centroid distance

7 Hands centroid referred to the face centroid

8 Derivatives of feature 7

9 Face and hands Hu-moments (Canny-Filter)

10 Face and hands Hu-moments

11 Hands area normalized respect to the face area

12 Derivatives of feature 11

13 Face and hands compactness 1

14 Face and hands eccentricity 1

15 Face and hands compactness 2

16 Face and hands eccentricity 2

17 Hands orientation

‘‘Canny-Filter’’ indicates that features have been calculated after

applying the Canny algorithm to the detected regions

Pattern Anal Applic

123

Page 6: Signer independent isolated Italian sign recognition based on hidden Markov models

where b 2 ½0; 1� and has been set to 0.1 in the experiments.

In features 1–8 centroid coordinates are normalized

respect to the face region width wf so that they are inde-

pendent of the distance from the camera and the signer’s

height.

Derivatives in features 2 and 5 are calculated as the

difference between the coordinates at frame t and at frame

t � 1:

_xt ¼ xt � xt�1; ð6Þ

_yt ¼ yt � yt�1: ð7Þ

The normalized Euclidean distance between the hands

(features 3 and 6) is computed as follows:

d ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijðxl; ylÞ � ðxr; yrÞj2

q

wf

; ð8Þ

where the l and r subscripts denote the centroid coordinates

of the left and right hand, respectively.

In the computation of feature 7, the origin of the coor-

dinate system is represented by the face area centroid

ðxf ; yf Þ and feature 8 is obtained calculating its derivative

as in Eq. (6).

Regarding the second subset, features 9 and 10 are

the Hu moments [23]: these are invariant to transla-

tion, changes in scale, and rotation. Both for the face

and the hands regions, seven different values are

computed:

I1 ¼l20 þ l02; ð9Þ

I2 ¼ l20 � l02ð Þ2þ4l211; ð10Þ

I3 ¼ l30 � 3l12ð Þ2þ 3l21 � l03ð Þ2; ð11Þ

I4 ¼ l30 þ l12ð Þ2þ l21 þ l03ð Þ2; ð12Þ

I5 ¼ l30 � 3l12ð Þ l30 þ l12ð Þ l30 þ l12ð Þ2�3 l21 þ l03ð Þ2h i

þ

þ 3l21 � l03ð Þ l21 þ l03ð Þ 3 l30 þ l12ð Þ2� l21 þ l03ð Þ2h i

;

ð13Þ

I6 ¼ l20 � l02ð Þ l30 þ l12ð Þ2� l21 þ l03ð Þ2h i

þ

þ 4l11 l30 þ l12ð Þ l21 þ l03ð Þ;ð14Þ

I7 ¼ 3l21 � l03ð Þ l30 þ l12ð Þ l30 þ l12ð Þ2�3 l21 þ l03ð Þ2h i

� l30 � 3l12ð Þ l21 þ l03ð Þ 3 l30 þ l12ð Þ2� l21 þ l03ð Þ2h i

;

ð15Þ

In the equations, lij is the respective central moment

defined in Eq. (4), and I7 is useful in distinguishing mirror

images.

Feature 11 is obtained calculating the areas of the rec-

ognized skin zones as the total number of pixels detected,

and the derivatives (feature 12) are calculated similarly to

Eq. (6):

_At ¼ At � At�1: ð16Þ

where At denotes the area of a region of interest at the time

frame t.

Compactness c1 (feature 13) and eccentricity e1 (feature

14) of face and hands have been proposed in [37]. The first

is computed as the ratio between the minor (Axismin) and

major (Axismax) axis lengths of the rotated rectangle that

inscribe the detected region:

c1 ¼Axismin

Axismax

: ð17Þ

Eccentricity is calculated from the area A and the perimeter

P of the region as follows:

e1 ¼Anorm

P2; ð18Þ

where Anorm is obtained dividing the area A by the face area

and the perimeter P is the length of the region contour used

in the feature 4.

Features 15–17 have been introduced in [2] and are

defined as follows:

c2 ¼ð4pAnormÞ

P2; ð19Þ

Fig. 4 Contours detection from

the skin regions (a). Bounding

rectangle of hands skin zones

(b). The hands edge obtained by

the Canny algorithm (c)

Pattern Anal Applic

123

Page 7: Signer independent isolated Italian sign recognition based on hidden Markov models

e2 ¼ðl20 � l02Þ2 þ 4l11

Anorm

; ð20Þ

o1 ¼ sinð2aÞ; o2 ¼ cosðaÞ: ð21Þ

Equations (19) and (20) represent alternative definitions of

compactness and eccentricity. Equation (21) defines the

hand orientation, obtained from the difference of the angle

between the image vertical axis and the main axis of the

rotated rectangle that inscribe the hand region. In the

equation, the angle a 2 ½�90�; 90�� and the orientation is

split into o1 and o2 to ensure stability at the interval borders.

3.3 Sign recognition

Let S ¼ fs1; s2; . . .; sNg be the set of signs and N be the

total number of signs. Each sign si is modelled with a

different HMM ki. To capture the sequential nature of a

sign execution, a left–right structure has been chosen on

purpose (Fig. 5). State emission probability density func-

tions are represented by mixtures of gaussians with diag-

onal covariance matrices. In defining the sign model, the

number of states and the number of gaussians in the mix-

ture must be prior set. These parameters have been deter-

mined experimentally as described in Sect. 4.

Denoting with O the observation sequence, sign recog-

nition is performed calculating the probability PðOjkiÞ 8i 2f1; 2; . . .;Ng through the Viterbi algorithm and then

selecting the model whose probability is maximum.

4 Experiments

Video processing has been performed by means of the open

source computer vision library OpenCV [6]. The afore-

mentioned A3LIS-147 database has been used both for

training the recognizer and for evaluating its performance.

The HMM-based sign recognition stage has been imple-

mented using the HMM toolkit [40].

The first experiment studies how the number of states

and of gaussians per state in the sign models affects the

recognition performance. The second experiment analyses

the performance of the system for different combinations

of features. The last experiment investigates the optimal

number of states and mixtures for the selected features and

performs the final test.

All the experiments have been performed following the

three-way data split approach [31] combined with the K-fold

cross-validation method [38]. The dataset is split into K ¼10 folds, each one composed of all the signs executed by the

same signer. For each parameters combination, two different

folds are chosen iteratively as validation and test sets. The

remaining K � 2 folds form the training set, and the algo-

rithm parameters are determined on the validation fold. In

the features selection, each combination is tested with same

method. The final system performance is evaluated using

onefold as test, and the remaining K � 1 folds as training.

Note that differently from [24], the independence between

training, validation and test sets is always guaranteed.

The performance have been evaluated using the sign

recognition accuracy, defined on the basis of the number of

substitution (S), deletion (D), and insertion errors (I) as in

speech recognition tasks [40]:

Accuracy ¼ ðN� D� S� IÞN

: ð22Þ

where N is the total number of labels in the reference

transcriptions. It is worth pointing out that in the isolated

sign recognition task addressed in this work only the sub-

stitution errors are present.

4.1 Model parameters selection

To find an estimate of the number of states and of the

number of gaussians per state of the sign models, the

system has been tested with different combinations of these

parameters. In particular, the number states has been varied

from 1 to 20 and the number of gaussians from 1 to 8.

Observations are represented by features 4, 5, and 6.

Figure 6 shows the obtained results: as expected,

increasing the number of states and the number of gaus-

sians per state does not always produce better performance.

Fig. 5 A five states left-right HMM. State I and F are non-emitting

Fig. 6 Sign recognition accuracy for different number of states and

mixtures using the three-way data split and the K-fold cross-

validation methods. The circle indicates the selected operating point

Pattern Anal Applic

123

Page 8: Signer independent isolated Italian sign recognition based on hidden Markov models

In particular, as long as the model has few parameters to

train, i.e., the number of states is less than 5, the system

performance increases with the number of states or mix-

tures. Beyond this value, the training data is not sufficient

to accurately estimate the model parameters. As shown in

Fig. 6, increasing the number of mixtures does not produce

a further performance improvement when the number of

states is higher than 7. As starting point for the feature

selection, an HMM model with 18 states and 1 mixture has

been chosen. This combination gives a recognition accu-

racy of 36:43 % (highlighted with a circle in Fig. 6).

4.2 Feature selection

To study which combination of features gives the overall

best performance in the proposed system, the Sequential

Forward Selection (SFS) technique has been followed [38].

SFS is suboptimal, but requires less computational burden

respect to full-search approaches. SFS consists of the fol-

lowing steps:

– Definitions

– M: total number of features.

– Initialization

– Create a set of M feature vectors, each composed of

a single feature.

– Algorithm

– for i ¼ 1 to M � 1

1. Compute the criterion value for each feature

vector of the set.

2. Select the winner vector, i.e., the one whose

criterion value is greatest.

3. Create a new set of M � i feature vectors

adding the excluded features to the winning

vector.

– end

– Select the feature vector whose criterion value is

greatest among all the created feature vectors.

In the experiments, the sign recognition accuracy has been

used as criterion value.

The results for each combination of features have been

evaluated by means of the three-way data split combined

with the K-fold cross-validation method, as illustrated in

Sect. 4.

Figure 7 shows the results obtained for the 17 iterations

of the SFS technique. Each column of the chart reports the

accuracy averaged over each three-way data split with K-

fold cross-validation step. The figure shows that adding

more features constantly improves the performance until a

maximum is reached at iteration 7. Then, the accuracy

degrades systematically, demonstrating that the additional

features do not improve the recognition capabilities of the

system. The feature vector that gives the overall best

accuracy is composed of features 5, 6, 7, 8, 12, 13, and 14

for a total of 21 coefficients.

Table 3 shows the detailed results for each signer. As

table shows, the best sign recognition accuracy is obtained

by signer MRLB using features 5, 7 and 13. As afore-

mentioned, the overall best accuracy is obtained at iteration

number seven. It is interesting to note the high degree of

variability across the signers: the recognition accuracy

standard deviation at iteration seven is 11.01 %, suggesting

that further normalization techniques could be employed to

improve signer independence.

4.3 Model parameters refinement

To find the overall best sign recognition accuracy obtain-

able with the proposed system, an additional experiment

has been performed. Here, the feature set determined in the

previous section is employed, and the same procedure

described in Sect. 4 has been followed to determine the

number of states and the number of gaussians per state of

the HMM sign model. The results in Table 4 show that the

model with 18 states and 1 mixture per state still produces

the best performance for the selected features.

Figure 8 shows the final results for each signer using the

determined model. The final average recognition accuracy

of the system is 48.06 %.

4.4 Discussion

Better insights on the performance of the HMM-based

system can be inferred analyzing the correlation between

the edit distances of the signs transcriptions and the related

error rates. Every signs has been transcribed using the

Fig. 7 Sign recognition accuracy for the 17 SFS iterations

Pattern Anal Applic

123

Page 9: Signer independent isolated Italian sign recognition based on hidden Markov models

HamNoSys transcription system [22] (the complete tran-

scriptions are shown in the Appendix), and calculating the

edit distance between a pair of signs gives valuable infor-

mation about the their confusability. Since HamNoSys

describes the movements of both hands in a single string,

each transcription has been decomposed in two substrings,

each related to a different hand. If the sign is executed with

one hand, an empty string has been assigned to the corre-

sponding unused hand. The distance between signs has

been computed comparing only the strings that belong to

the same hand, and the resulting edit distance is given as

sum of the values obtained for each hand. The edit distance

has been calculated giving an equal cost 1 to characters

insertions or deletions and cost 2 to substitutions.

The relationship between the error rate and the edit

distance has been obtained as follows: denoting with

Dðsi; sjÞ the edit distance between sign si and sj and with

Sd ¼ fðsi; sjÞjDðsi; sjÞ ¼ d; i ¼ 1; . . .;N; j ¼ 1; . . .;N; i 6¼ jg; ð23Þ

the set of sign pairs having edit distance d, the average

error rate for the edit distance d can be defined as:

EðdÞ ¼ 1

jSdjX

8ðsi;sjÞ2Sd

eðsi; sjÞ; ð24Þ

where jSdj indicates the number of elements in Sd, and

eðsi; sjÞ ¼Aði; jÞ

PNj¼1 Aði; jÞ

; i 6¼ j: ð25Þ

In the equation, the term Aði; jÞ indicates the number of

times sign si has been classified as sign sj. The matrix

A ¼ fAði; jÞgNi;j¼1 is the confusion matrix. Figure 9 shows

how EðdÞ varies for the different values of the edit dis-

tance. As expected, there is an inverse relationship between

edit distances and the error rates. Although this is partic-

ularly evident for values greater than 10, this behaviour is

more irregular when the distance is below this value. For

example, the error rate is 10 % when the edit distance is 2,

while it reaches 20 % when the edit distance increases to 5.

The authors presume that this errors can be ascribed to the

overlap between face and hands and between the two

Table 3 Sign recognition accuracy (%) for each A3LIS-147 signer

Signer Iteration

1 2 3 4 5 6 7

FAL 42.76 45.83 47.59 47.44 49.35 51.42 52.03

FEF 15.86 23.83 25.06 25.37 26.52 23.30 23.07

FSF 26.90 31.34 34.02 35.32 33.71 34.48 34.79

MDP 34.03 40.28 44.83 45.99 49.15 52.55 52.39

MDQ 41.38 49.35 51.34 51.49 51.65 52.79 53.94

MIC 38.39 39.69 40.00 41.76 42.84 45.37 45.98

MMR 47.82 52.41 54.94 56.93 54.17 58.70 58.39

MRLA 33.57 41.96 45.22 50.35 54.16 50.19 50.35

MRLB 43.30 56.93 57.47 57.32 57.09 56.93 57.32

MSF 42.68 46.44 45.75 48.58 50.88 51.49 52.34

Average 36.67 42.81 44.62 46.06 46.95 47.72 48.06

Std Dev 9.50 9.80 9.68 9.79 9.80 10.88 11.01

Table 4 Average sign recognition accuracy (%) for different number

of states and gaussians using features 5, 6, 7, 8, 12, 13 and 14

States Gaussians

1 2 3 4

15 48.02 47.32 44.49 41.34

16 47.69 47.11 44.25 41.00

17 47.97 47.20 44.26 40.35

18 48.06 47.02 43.81 40.06

19 47.98 46.63 43.10 39.69

20 46.97 46.31 42.92 38.92

21 46.82 44.65 41.06 37.95

Fig. 8 Overall sign recognition accuracy using 18 states and 1

mixture

Fig. 9 Correlation between edit distance and error rate. The ‘‘X’’

indicates that at positions 1, 3, and 4 that there are no sign pairs

having these distance values

Pattern Anal Applic

123

Page 10: Signer independent isolated Italian sign recognition based on hidden Markov models

hands, and to the inability to discriminate the 3-D position

of the subject.

As an example, consider the pairs ‘‘Lunedı - Sabato’’

and ‘‘Lunedı - Venerdı’’. The first has edit distance 2 and

an error rate of 10 %, while the second has edit distance 5,

but the error rate increases to 20 %. Observing their tran-

scription in Table 5, it can be observed that all the signs of

the days of the week have similar arm movements and the

information is entirely contained in the hands shapes.

However, considering the signs ‘‘Lunedı’’ and ‘‘Venerdı’’

the position of the right hand and its overlap with the face

are very similar. This can be noticed observing the frame

sequence excerpts in Fig. 10.

Analysing the results reported in Fig. 8, it is evident that

the worst accuracy values are always scored by the signer

FEF. The authors suppose that the cause is again related to

the overlap between head and hands. In fact, the signer

often executes the signs with the hands closer to the head

region respect to the others signers of the corpus. The

consequence is the increasing of the overlaps number

between head and hands, i.e., loss of useful information.

4.4.1 Related background

In this subsection, the obtained results are discussed in the

light of state of the art contributions currently present in

the literature. Table 6 shows the accuracy obtained by

each method, the characteristics of the corpus employed

in the experiments, whether the recognition is performed

at the word or sentence level, and if the method operates

in a signer dependent or independent scenario. First of all,

let’s consider the first Italian sign language recognition

work, proposed by [24]. As mentioned above, such a

framework deals with the continuous sign case study. It

extracts geometric features, employs a neural network for

the classification stage and uses a database with a

vocabulary of 40 signs. No specific distinction is made

among the signers involved in training and testing of the

system, in contrast to the proposed approach which is

purely signer-independent. The presented results show an

overall accuracy about the correct translated sentences

using two different data set. In the first, the test set is

Fig. 10 Frame sequence excerpts of words ‘‘Lunedı’’ (left) and

‘‘Venerdı’’ (right)

Table 5 Recall values for each day of week sign and the HamNoSys

transcriptions

Pattern Anal Applic

123

Page 11: Signer independent isolated Italian sign recognition based on hidden Markov models

composed of 30 sentences with 20 distinct signs, and the

obtained accuracy value is 83:30 %. In the second, the test

set is composed of 80 sentences and 40 distinct signs, and

the accuracy value is 82:50 %.

Looking at the international projects dealing with the

signer-independent isolated sign recognition case study, the

work described in [2] operates using manual features in

controlled environments. Experiments conducted on a

database involving four signers give an average recognition

accuracy of about 45:23 %, which is consistent with the

performance obtained by the HMM system proposed in this

paper, whose accuracy is 48:06 %.

Hidden Markov Model-based recognition applied to

sign language has been proposed for the first time by [35].

To obtain a real-time execution, the signer wear distinctly

coloured gloves on each hand and the frame were deleted

when the tracking of one or both hands was lost. They

executed an experiment using 395 training sentences and

99 independent test sentences with a vocabulary of 40

words. The system achieved an accuracy of 99:2 % using a

grammar modelling and 91:3 % for free grammar. In a

successive study [36], the authors introduced two different

dataset, changed the camera mount point, and recorded the

same sentences. For testing purposes, 384 sentences were

used for training and 94 were reserved for testing. The

testing sentences are not used in any portion of the training

process. For the second person view, camera mounted on

the desk, the performance reach an accuracy of 91:9 %,

with grammatical rules, and 79:5 % unrestricted. As for the

first person view, hat-mounted camera, the obtained accu-

racy is 97:8 %, with grammatical rules, and 96:8 %unrestricted.

In [42] a simple feature combination has been studied to

improve the performance of the recognition system. They

have obtained relevant results with an overall accuracy of

89:09 %, with the combination of PCA, kurtosis position

feature vector and the MCC feature. They also used an

HMM classifier, but a comparison between these results

and the ones described in this paper is not feasible because

of the wide difference between the used databases. More-

over, it must be observed that, as clearly stated by [42, ‘‘the

system vocabulary is restricted to avoid hand over face

occlusion, location interchange of dominant hand and right

hand’’ in their work, in contrast to what done in the present

contribution where all signs have been considered for

training and testing.

Noteworthy are the studies proposed by [18] and [3],

where alternative acquisition techniques have been inves-

tigated. In [18] two CyberGloves and three Pohelmus

3SPACE-position trackers have been used as input devices.

The gloves collect the information about hand shapes, and

the trackers collect the orientation, position and movement

trajectory information. The classification stage is based on

a novel combination of HMMs and self-organizing feature

maps to implement a signer-independent recognizer of

Chinese sign language. All the experiments have been

conducted on a database where 7 signers have performed

208 signs for 3 times. The tests presented an overall

accuracy of 88:20 % for the unregistered test set.

A system for continuous sign language recognition has

been introduced in [3]. The main aspect of the proposed

approach lies in the acquisition level where an electro-

myographic (EMG) data-glove has been used. The EMG

allows a good gesture segmentation, based on muscle

activity. The classification phase is divided in two distinct

levels: gesture recognition and language modelling. The

former one is based on a four channel HMM classifier,

where each channel identify one gesture articulation:

Table 6 Summary of the existing contribution in the state of the art

Contr. Accuracy (%) V/Se/Si Rec. Type Signer Device

[2] 45.23 � 220/–/4 Word Independent Camera

[18] 88.20 208/4368/7 Word Independent Gloves & Trackers

[24] 83.30–82.50 40/30–80/2 Sentence Dependent Camera

[42] 89.09 30/110/3 Sentence Dependent Camera

[35] 99.20–91.30 40/494/1 Sentence Dependent Camera

[36] 91.90–79.50 40/478/1 Sentence Dependent Camera�

[36] 97.80–96.80 40/478/1 Sentence Dependent Camera��

V/Se/Si indicates the vocabulary dimension, the sentences number and the number of signers

* second person view. ** first person view

Pattern Anal Applic

123

Page 12: Signer independent isolated Italian sign recognition based on hidden Markov models

movement, hand shape, orientation, and rotation. The latest

one uses a single channel HMM classifier to match a rec-

ognized gesture with a model sentence based on language

grammar and semantic. Unfortunately, the system has been

tested only on an authors’ own database and no results are

reported.

4.5 Computational complexity

Here a detailed computational complexity analysis of the

algorithms of the automatic LIS recognition system is

discussed and the real-time factor related to the framework

implementation on a common hardware/software platform

is provided.

The FHD stage is characterized by two main algorithms:

skin detection and morphological smoothing. The Feature

Extraction stage is characterized by the Canny edge

detector and the operation required to calculate the features

of Table 2, which are directly performed on the image

regions determined by the FHD stage on each video frame.

Table 7 reports the number of FLoating-point Operations

Per Second (FLOPS) of each stage, as well as their time

occupancy respect to the total execution time. Consider

that in calculating the number of FLOPS, multiplications/

divisions and additions/subtractions have been equally

weighted. Note also that for those operations it is assumed

that the skin regions are composed of 30 % of the overall

amount of image pixels. This value coincides with the

worst case in A3LIS-147 the database. The equivalent time

consumed for each operation is based on the performance

achieved on a Toshiba Satellite L500D-159 laptop equip-

ped with an AMD Athlon II Dual-Core M300x2 running at

2 GHz and with 4 GB of RAM. It is worth reminding that

the number of frame-per-second of the database videos is

25.

Looking at the values reported in Table 7, it is evident

that the Canny Filter represents the bottleneck in the FHD

and Feature Extraction stages from the computational

complexity perspective, and that there is thus space to

calculate other feature based on the Canny outcome and/or

evaluating more performing skin detection procedures.

Note that also the morphological smoothing can induce a

certain burden, in dependence on the size of the structuring

element: in the present case studies, its size is equal to 9

and bigger values were not beneficial to improve overall

system performance.

For what concern the recognition stage, the Viterbi

algorithm [30] requires most of the computational

resources. Indeed, such algorithm is executed for each

sign and to calculate the probability that the input feature

vectors are generated by the sign model. The computa-

tional cost of these operations depends on the number of

HMM states N, the number of signs in the vocabulary V

and the number of frames per second T of the video input

to recognize. Moreover, it is linearly dependent on the

operations needed to calculate the observation density,

i.e., a mixture of G Gaussians whose observation vector

has size L. As pointed out in [30], the computational cost

is given by:

FLOPS VTN2ð6Lþ 2ÞG: ð26Þ

Therefore, assuming N ¼ 18, G ¼ 1, L ¼ 21, (as in best

results reported above) and V ¼ 147, T ¼ 25 (due to

database characteristics), the computational cost is ap-

proximatively 152.4 MFLOPS, thus greater than the one

attained in the two former stages.

The computational burden of the recognition process has

been evaluated also in terms of real-time factor, defined as

the ratio between the processing time and the length of the

video sequence. On average, the whole process, including

the features calculation and the Viterbi algorithm for sign

recognition, requires about 8 s per video segment. This

results in a real-time factor of about 2.6, which represents

an encouraging result for future real-time developments

also on embedded platforms. Indeed, considering that no

specific optimization strategies have been carried out so

far, space for improvements can be foreseen from this

perspective.

5 Comparative evaluation with SVMs

To have better insights on the performance of the HMM-

based system, an SVM approach has been considered [10].

SVMs are binary classifiers that discriminate whether an

input vector x belongs to class þ1 or to class �1 based on

the following discriminant function:

Table 7 Computational Cost in MFLOPS (Mega-FLOPS) and time

consumed for all algorithms in the Automatic LIS Recognition tool

MFLOPS Time occupancy (%)

Skin detector 31:1 10:00

Morphological smoothing 112:0 25:00

Canny filter 134:5 30:00

Feature operations 6:2 1:67

Recognition 152:4 33:33

The ‘‘Feature Operations’’ includes all formulas appearing in Table 2,

the calculation of moments which are used therein and the face

smoothing described in 5

Pattern Anal Applic

123

Page 13: Signer independent isolated Italian sign recognition based on hidden Markov models

f ðxÞ ¼XN

i¼1

aitiKðx; xiÞ þ d; ð27Þ

where ti 2 fþ1;�1g, ai [ 0 andPN

i¼1 aiti ¼ 0. The

terms xi are the ‘‘support vectors’’ and d is a bias term

that together with the ai are determined during the

training process of the SVM. The input vector x is

classified as þ1 if f ðxÞ 0 and �1 if f ðxÞ\0. The

kernel function Kð�; �Þ can assume different forms [5].

Regardless the kernel, the basic assumption is that all

the input vectors are of the same dimension. However,

in the sign recognition scenario, the video sequences are

composed of a varying number of frames, thus the

number of feature vectors changes from video to video.

A popular approach to deal with the problem is using

the so called alignment kernels.

In this work, the Gaussian Dynamic Time Warping

kernel (GDTW) [21] has been employed. The kernel

assumes the form:

Kðx; xjÞ ¼ expð�cDTWðx; xjÞÞ; c[ 0; ð28Þ

Notice that GDTW is a variant of the radial basis function

kernel Kðx; xiÞ ¼ expð�ckx� xik2Þ, where the term kx�xik2

is replaced with Dynamic Time Warping [29], thus

able to deal with variable length sequences.

Since SVMs are binary classifiers, the multiclass prob-

lem has been addressed using the ‘‘one versus all’’ strategy.

All the experiments have been performed using LIBSVM

(a library for Support Vector Machines) [9].

As for the HMM model parameters selection illustrated

in Sect. 4.1, the first test aims at evaluate the best c value to

use for the feature selection tests. Figure 11 shows the

obtained results varying c from 2�15 to 217. The best

accuracy, 24:93 %, is reached for c ¼ 29. All the test have

been executed following the three-way data split method

combined with the 10-fold cross validation as for the

HMM-based system.

Figure 12 shows the final performance obtained using

the same features of the HMM approach: the resulting

accuracy is 21.49 % below the one obtained with the HMM

system.

Since the adopted SFS procedure depends on the clas-

sifier, it has been repeated for the SVM approach to find the

best performing features. The resulting features for SVM

are the normalized hands centroid distance and the hands

centroid referred to the face centroid. Figure 13 shows the

obtained results: the average accuracy of the SVM

improves to 34:14 %; however, it is still below the one

obtained with the HMM approach, 48:06 %.

6 Conclusion

In this work, a system for the automatic recognition of

isolated signs of the Italian sign language has been

Fig. 11 Sign recognition accuracy for different values of c using the

three-way data split and the K-fold cross-validation methods

Fig. 12 Overall sign recognition accuracy using the same features of

the HMM system

Fig. 13 Overall sign recognition accuracy using optimal features for

the SVM approach (i.e., features 6 and 7)

Pattern Anal Applic

123

Page 14: Signer independent isolated Italian sign recognition based on hidden Markov models

described and evaluated. The approach operates inde-

pendently of the signer, and it is based on three distinct

algorithmic stages operating in cascade: the first for face

and hands detection, the second for feature extraction and

the third for sign recognition, based on the HMS para-

digm. The authors, with the support of the local office of

the Deaf People National Institute (ENS), have created a

suitable database made of 1,470 signs to be used for the

training and testing phases of the recognizer. In the

experiments, several features have been evaluated for

classification and the overall best performing set has been

determined using the sequential feature selection

technique.

Performed computer simulations have shown that it is

possible to reach an average recognition accuracy of

almost 50 %, which is surely compliant with the results

obtained by recognition systems proposed in the litera-

ture for other sign languages and in similar experimental

case studies. A detailed discussion on the obtained

results has been provided analysing the correlation

between the edit distance of the HamNoSys transcrip-

tions of the signs and the misclassification rates. The

study demonstrated that they are inversely proportional,

and that when the edit distance is low the main sources

of errors are the overlap between hands, and between

hands and face.

To prove the effectiveness of the HMM-based sys-

tem, a comparative evaluation with an SVM-based

approach has been conducted. Two different tests have

been executed to evaluate the SVM performance. In the

first, the same features used in the model refinement

experiment of the HMM approach have been used. In

the latter, a new set of features has been chosen to

achieve the SVM best performance. In both cases, the

accuracy values are below the one obtained with the

HMM system.

As future works, different issues will be addressed to

achieve an overall improvement of the recognition

accuracy. As reported in [39], better skin detection

techniques rather than the one available in the OpenCV

library could be used to improve the discrimination of

the regions of interest. Another substantial adjustment

concerns the problem of the overlap and the demand to

increase the shape detection. These aspects will be

handled by investigating the approaches proposed in

[28] and [25]. The introduction of the PCA technique

proved as a good approach to reduce the dimensionally

of the frame in vision appearance features. In fact, the

technique has been widely used for different frame-

works [2, 20, 42] and seems to have a beneficial influ-

ence. Another aspect to investigate could be the

introduction of different hands tracker, like those pro-

posed by [3, 18].

Moreover, the development of a new sign database

including audio and spatial information (e.g., using

multiple cameras, or Microsoft KinectTM) to augment

the feature set and reduce errors due to overlapping and

wrong spatial detection of the subject is actually under

investigation. Clearly, in the new database it will be

necessary to increase the number of vocabulary signs

and to introduce more realizations of each sign per

signer. The proposed approach will be also assessed

using on sign corpora to compare the results with

alternative techniques, and to verify the performance on

different scenarios.

Finally, the authors have been also directing their

efforts in extending the usability of the tool to the

continuous sign case study, which requires the intro-

duction of a suitable language model within the recog-

nizer stage. The HMM-based approach used in the

proposed framework surely represents a good choice

from this perspective, as effectively demonstrated by

many continuous speech recognition applications [33,

35, 36].

Appendix: HamNoSys Sign Transcription

Pattern Anal Applic

123

Page 15: Signer independent isolated Italian sign recognition based on hidden Markov models

Pattern Anal Applic

123

Page 16: Signer independent isolated Italian sign recognition based on hidden Markov models

Pattern Anal Applic

123

Page 17: Signer independent isolated Italian sign recognition based on hidden Markov models

References

1. von Agris, U, Kraiss, KF (2007) Towards a video corpus for

signer-independent continuous sign language recognition. In:

Proceedings of the International Workshop on Gesture in

Human–Computer Interaction and Simulation, Lisbon

2. von Agris U, Zieren J, Canzler U, Bauer B, Kraiss KF (2008)

Recent developments in visual sign language recognition. Uni-

vers Access Inf Soc 6(4):323–362

3. Al-Ahdal M, Tahir N (2012) Review in sign language recognition

systems. In: Proceedings of IEEE Symposium on Computers

Informatics, pp 52–57

4. Bertoldi N, Tiotto G, Prinetto P, Piccolo E, Nunnari F, Lombardo

V, Mazzei A, Damiano R, Lesmo L, Del Principe A (2010) On

the creation and the annotation of a large-scale Italian-LIS par-

allel corpus. In: Proceedings of the International Conference on

Language Resources and Evaluation, pp 19–22

5. Bishop C (2006) Pattern recognition and machine learning.

Springer Science?Business Media, LLC, New York

6. Bradski G (2000) The OpenCV library. Dr Dobb’s J Softw Tools

25(11):120 (122–125)

7. Bradski G, Kaehler A (2008) Learning OpenCV: computer vision

with the OpenCV library. O’Reilly Media Inc., Sebastopol

8. Canny J (1986) A computational approach to edge detection.

IEEE Trans Pattern Anal Mach Intell 8(6):679–698

9. Chang CC, Lin CJ (2011) LIBSVM: A library for support vector

machines. ACM Trans Intell Syst Technol 2:1–27

10. Cortes C, Vapnik V (1995) Support-vector networks. In: Machine

Learning, pp 273–297

11. Dreuw P, Forster J, Deselaers T, Ney H (2008) Efficient

approximations to model-based joint tracking and recognition of

continuous sign language. In: Proceedings of the IEEE Interna-

tional Conference on Automatic Face Gesture Recognition, Los

Alamitos, pp 1–6

12. Dreuw P, Neidle C, Sclaroff VAS, Ney H (2008) Proc Int Conf

Lang Resour Eval., Benchmark databases for video-based auto-

matic sign language recognitionEuropean Language Resources

Association, Marrakech

13. Dreuw P, Ney H (2008) Visual modeling and feature adaptation

in sign language recognition. In: ITG Conference on Voice

Communication (SprachKommunikation), pp 1–4

14. Dreuw P, Ney H, Martinez G, Crasborn O, Piater J, Moya JM,

Wheatley M (2010) The signspeak project—bridging the gap

between signers and speakers. In: Proceedings of the Interna-

tional Conference on Language Resources and Evaluation.

Valletta, Malta

15. Dreuw P, Rybach D, Deselaers T, Zahedi M, Ney H (2007)

Speech recognition techniques for a sign language recognition

system. In: Proceedings of Interspeech, pp 2513–2516

16. Fagiani M, Principi E, Squartini S, Piazza F (2012) A new

Italian sign language database. In: Zhang H, Hussain A, Liu D,

Wang Z (eds) Advances in brain inspired cognitive systems,

Lecture Notes in Computer Science, vol 7366. Springer,

pp 164–173

17. Fagiani M, Principi E, Squartini S, Piazza F (2013) A new system

for automatic recognition of italian sign language. In: Apolloni B,

Bassis S, Esposito A, Morabito FC (eds) Neural nets and sur-

roundings, smart innovation, systems and technologies, vol 19.

Springer, Berlin, pp 69–79

18. Fang G, Gao W, Ma J (2001) Signer-independent sign language

recognition based on SOFM/HMM. In: Proceedings of the IEEE

ICCV workshop on recognition, analysis, and tracking of faces

and gestures in real-time systems, pp 90–95

19. Gonzalez RC, Woods RE (2007) Digital image processing, 3rd

edn. Prentice Hall, USA

20. Gweth Y, Plahl C, Ney H (2012) Enhanced continuous sign

language recognition using PCA and neural network features. In:

Proceedings of the computer vision and pattern recognition

workshops, pp 55–60

21. Haasdonk B, Bahlmann C (2004) Pattern recognition. In: Ras-

mussen C, Bulthoff HH, Scholkopf B, Giese MA (eds) Learning

with distance substitution kernels., Lecture notes in computer

scienceSpringer, Berlin, pp 220–227

22. Hanke T (2004) HamNoSys-representing sign language data in

language resources and language processing contexts. In: Pro-

ceedings of LREC, pp 1–6

23. Hu MK (1962) Visual pattern recognition by moment invariants.

IRE Trans Inf Theory 8(2):179–187

24. Infantino I, Rizzo R, Gaglio S (2007) A framework for sign

language sentence recognition by commonsense context. IEEE

Trans Syst Man Cybern C Appl Rev 37(5):1034–1039

25. Kelly D, McDonald J, Markham C (2010) A person independent

system for recognition of hand postures used in sign language.

Pattern Recognit Lett 31(11):1359–1368

26. Kukharev G, Nowosielski A (2004) Visitor identification—

elaborating real time face recognition system. In: Proceedings of

the international conference on computer graphics, visualization

and computer vision. UNION Agency, Plzen, pp 157–164

27. Maebatake M, Suzuki I, Nishida M, Horiuchi Y, Kuroiwa S

(2008) Sign language recognition based on position and move-

ment using multi-stream HMM. In: Proceedings of the interna-

tional symposium on universal communication. Los Alamitos,

pp 478–481

28. Quan Y (2010) Chinese sign language recognition based on video

sequence appearance modeling. In: Proceedings of the IEEE

Conference on Industrial Electronics and Applications,

pp 1537–1542

29. Rabiner L, Juang BH (1993) Fundamentals of speech recognition.

Prentice-Hall Inc., Upper Saddle River

30. Rabiner LR (1989) A tutorial on hidden Markov models and selected

applications in speech recognition. Proc IEEE 77(62):257–286

31. Ripley BD (1996) Pattern recognition and neural networks.

Cambridge University Press, Cambridge

32. Sandler W, Lillo-Martin D (2006) Sign language and linguistic

universals. Cambridge University Press, Cambridge

33. Saon G, Chien JT (2012) Large-vocabulary continuous speech

recognition systems: a look at some recent advances. IEEE Signal

Process Mag 29(6):18–33

34. Serra J (1983) Image analysis and mathematical morphology.

Academic Press Inc., Orlando

35. Starner T, Pentland A (1995) Real-time American sign language

recognition from video using hidden Markov models. In: Pro-

ceedings of the international symposium on computer vision,

pp 265–270

36. Starner T, Weaver J, Pentland A (1998) Real-time American sign

language recognition using desk and wearable computer based

video. IEEE Trans Pattern Anal Mach Intell 20(12):1371–1375

37. Theodorakis S, Katsamanis A, Maragos P (2009) Product-HMMs

for automatic sign language recognition. In: Proceedings of the

international conference on acoustics, speech and signal pro-

cessing. Taipei, Taiwan, pp. 1601–1604

38. Theodoridis S, Koutroumbas K (2008) Pattern recognition, 4th

edn. Academic Press, Burlington

39. Vezhnevets V, Sazonov V, Andreeva A (2003) A survey on

pixel-based skin color detection techniques. In: Proceedings of

GraphiCon, pp 85–92

40. Young SJ, Evermann G, Gales MJF, Hain T, Kershaw D, Moore

G, Odell J, Ollason D, Povey D, Valtchev V, Woodland PC

(2006) The HTK Book, version 3.4. Cambridge University Press,

Cambridge

Pattern Anal Applic

123

Page 18: Signer independent isolated Italian sign recognition based on hidden Markov models

41. Zahedi M, Keysers D, Deselaers T, Ney H (2005) Pattern rec-

ognition. In: Kropatsch W, Sablatnig R, Hanbury A (eds) Com-

bination of tangent distance and an image distortion model for

appearance-based sign language recognition., Lecture Notes in

Computer ScienceSpringer, Berlin, pp 401–408

42. Zaki MM, Shaheen SI (2011) Sign language recognition using a

combination of new vision based features. Pattern Recognit. Lett.

32(4):572–577

Pattern Anal Applic

123