video-based signer-independent arabic sign language recognition using hidden markov models

Applied Soft Computing 9 (2009) 990–999

Video-based signer-independent Arabic sign language recognitionusing hidden Markov models

M. AL-Rousan a,*, K. Assaleh b, A. Tala’a c

a Computer Engineering Department, Jordan University of Science and Technology, P.O. Box 3030, Irbid 22110, Jordanb American University of Sharjah, Electrical Engineering Department, P.O. Box 26666, Sharjah, United Arab Emiratesc Ministry of Education, Private Education Sector, Irbid, Jordan

A R T I C L E I N F O

Article history:

Received 22 October 2007

Received in revised form 29 July 2008

Accepted 18 January 2009

Available online 31 January 2009

Keywords:

Arabic sign language (ArSL)

Hand gestures

HMM

Deaf people

Feature extraction

DCT

A B S T R A C T

Sign language in Arab World has been recently recognized and documented. There have been no serious

attempts to develop a recognition system that can be used as a communication means between hearing-

impaired and other people. This paper introduces the first automatic Arabic sign language (ArSL)

recognition system based on hidden Markov models (HMMs). A large set of samples has been used to

recognize 30 isolated words from the Standard Arabic sign language. The system operates in different

modes including offline, online, signer-dependent, and signer-independent modes. Experimental results

on using real ArSL data collected from deaf people demonstrate that the proposed system has high

recognition rate for all modes. For signer-dependent case, the system obtains a word recognition rate of

98.13%, 96.74%, and 93.8%, on the training data in offline mode, on the test data in offline mode, and on

the test data in online mode respectively. On the other hand, for signer-independent case the system

obtains a word recognition rate of 94.2% and 90.6% for offline and online modes respectively. The system

does not rely on the use of data gloves or other means as input devices, and it allows the deaf signers to

perform gestures freely and naturally.

� 2009 Elsevier B.V. All rights reserved.

Contents lists available at ScienceDirect

Applied Soft Computing

journal homepage: www.elsev ier .com/ locate /asoc

1. Introduction

Sign language as a kind of gestures is one of the most naturalway of communication for most people in deaf community.There has been a resurging interest in recognizing human handgestures. The aim of the sign language recognition is to provide anaccurate and convenient mechanism to transcribe sign gesturesinto meaningful text or speech so that communication betweendeaf and hearing society can easily be made. Hand gestures arespatiotemporally varying and hence the automatic gesturerecognition turns out to be very challenging.

In this section we focus our discussion of the efforts made byresearchers on sign language gesture recognition in general, and onArabic sign language (ArSL) in particular. Sign language recogni-tion (SLR) can be categorized into isolated SLR and continuous SLRand each can be further classified into signer-dependent andsigner-independent according to the sensitivity to the signer. Alsoone may classify SLR systems as either glove-based (if the systemrelies on electromechanical devices for data collection), or none

* Corresponding author.

E-mail addresses: [email protected], [email protected] (M. AL-Rousan),

[email protected] (K. Assaleh), [email protected] (A. Tala’a).

1568-4946/$ – see front matter � 2009 Elsevier B.V. All rights reserved.

doi:10.1016/j.asoc.2009.01.002

glove-based (if bare hands are used). The learning and recognitionmethods used in previous studies to automatically recognize SLRinclude neural networks, rule-based matching, and hidden Markovmodels (HMMs).

Cyber gloves have been extensively used in most of previousworks on SLR including [1–3,5,10–15,]. Kadous [15] reported asystem based on power gloves to recognize a set of 95 isolatedAustralian sign languages with 80% accuracy. Instance-basedlearning and decision tree learning were adopted by the system toproduce the rules of pattern. Kim et al. [16] used fuzzy min–maxneural network and fuzzy logic method to recognize 31 manualalphabets and 131 Korean signs based on data gloves. An accuracyof 96.7% for manual alphabets and 94.3% for the sign words werereported. Grobel and Assan [12] used HMM to recognize isolatedsigns with 91.3% accuracy out of a 262-sign vocabulary. Theyextracted 2D features from video recordings of signers wearingcolored gloves. Recently, Gao et al. [10] used a dynamic pro-gramming method to build the context-dependent models forrecognizing continuous Chinese sign language. Data gloves wereused as input devices and state-tying HMM as the recognitionmethod. Their system can recognize 5177 isolated signs with 94.8%accuracy in real time and recognize 200 sentences with 91.4% wordaccuracy. More detailed work on Chinese sign language is reportedin [11] using data gloves as input devices. Colored gloves were used

mailto:[email protected]




http://www.sciencedirect.com/science/journal/15684946

http://dx.doi.org/10.1016/j.asoc.2009.01.002

M. AL-Rousan et al. / Applied Soft Computing 9 (2009) 990–999 991

in [13] where HMM was employed to recognize 52 signs of Germansign language with a single color video camera as input. In a similarwork [6] an accuracy of 80.8% was reached in the corpus of 12different signs and 10 subunits using the K-means clusteringalgorithm to get the subunits for continuous SLR. Liang andOuhyoung [17] employed the time-varying parameter threshold ofhand posture to determine end-points in a stream of gesture inputfor continuous Taiwanese SLR with the average recognition rate of80.4% over 250 signs. In their system HMM was employed, anddata gloves were taken as input devices.

The use of cyber gloves or other means of input devices conflictswith recognizing gestures in a natural context and is very difficultto run in real time. Therefore, recently researchers presentedseveral SLR systems based on computer vision techniques[19,21,24]. Starner et al. [20] used a view-based approach for

Fig. 1. Sample gestures of Ara

continuous American SLR. They used a single camera to extracttwo-dimensional features as the input of HMM. The word accuracyof 92% or 98% was obtained when the camera was mounted on thedesk or in a user’s cap in recognizing the sentences with 40different signs. However, the user must wear two colored gloves (ayellow glove for the right hand and an orange glove for the left) andsits in a chair before the camera. Vogler and Metaxas [23] usedcomputer vision methods and interchangeably AT flock of birds for3D data extraction of 53 signs for American Sign Language. They,respectively, built context-dependent HMM and modeled tran-sient movement to alleviate the effects of movement epenthesis.Experiments over 64 phonemes extracted from 53 signs showedthat modeling the movement epenthesis has better performancethan context-dependent HMM. The system provided an overallaccuracy of 95.83% is reported.

bic sign language (ArSL).

M. AL-Rousan et al. / Applied Soft Computing 9 (2009) 990–999992

As one can see from previous research on sign languagerecognition, the majority of the proposed systems have relied onthe use of some sort of input devices (such as colored and cybergloves). Furthermore, most of the above systems are signer-dependent systems. A more convenient and efficient system is theone that allows deaf users to perform gestures naturally with noprior knowledge about the user. To the best of our knowledge thereare a few published works on signer-independent systems. Gao etel. [11] developed a system for recognizing vocabulary Chinesesign language using self-organizing features maps (SOFM) andHMM as recognition methods. The system obtained a wordrecognition rate of 82.9% over a 5113-sign vocabulary and anaccuracy of 86.3% for signer-independent continuous SLR. Theyalso introduced a signer-independent system for continuous signlanguage recognition, in which the SRN/HMM model was appliedfor continuous Chinese sign language [9]. However, the maindrawback of the two systems is the use of cyber gloves for datacollection. Vamplew and Adams [22] proposed a signer-indepen-dent system to recognize a set of 52 signs. The system used amodular architecture consisting of multiple feature-recognitionneural networks and a nearest neighbor classifier to recognizeisolated signs. They reported a recognition rate of 85% in the testset. Again the signer must wear cyber gloves while performinggestures. Another attempt is made by Fang et al. [8] in which theyused the SOFM/HMM model to recognize signer-independent CSLover the 4368 samples from 7 signers with 208 isolated signs.

ArSLs are still in their development stages. According to theunified ArSL approved by the Arab League of States [2], the ArSLdictionary contains more than 1400 signs. Fig. 1 shows samplesof ArSL. Only in recent years has there been awareness of the

Fig. 2. The ArSL reco

existence of the deaf community. Therefore, and as far as we areaware, no serious progress has been made in ArSL recognition. Tothe best of our knowledge, there have been only three researchstudies on alphabet ArSL [1,3,5] and only one study dealt with ArSLfor isolated word recognition [18]. In our work in [3] we developeda system to recognize isolated signs for 28 alphabets using coloredgloves for data collection and adaptive neuro-fuzzy inferencesystems (ANFIS) for recognition method. A recognition rate of 88%was reached. Later, and on a similar work, the recognition rate wasincreased to 93.41% using polynomial networks [5]. Aljarrah [1]used spatial features and a single camera to recognize 28 ArSLalphabets without using gloves. They reported a recognition rate of93.55% using ANFIS method. The sole research work to recognizeisolated words in ArSL has been presented by Mohandes et al. [18].In this preliminary study, several set of words have been used andtested using power gloves for data collection and super vectormachine (SVM) learning method. This work will be furtherdiscussed in Section 5. Halawani [14] has developed a wirelesssystem that translates an input text to its associated sign languageusing a huge database. The system operates for mobile devices andInternet access.

This paper proposes a recognition system for Arabic signlanguage. We claim that it is the first system that presents thefollowing features. First, it is a vision-based system that uses nogloves or any other instrumented devices. Second, it recognizesisolated signs. Third, recognition can be made off line or on line(real-time). Finally, the system is signer-independent.

The remainder of this paper is organized as follows. In Section 2we discuss the structure of the ArSL recognition system andthe database used in building the system. Section 3 explains the

gnition system.


method used for feature extraction, DCT followed by a briefdiscussion about basic HMM modeling in Section 4. Resultsobtained from our system and system evaluations are discussed inSection 5. Finally we conclude in Section 6.

2. ArSL database collection

In this section we briefly describe and discuss our ArSLrecognition system and the database used to train and test thesystem. The recognition system is composed of several stages asshown in Fig. 2. These stages are video capturing, segmentation,background and feature extraction, and finally, recognition.

Because there has been no serious attention to Arabic signlanguage recognition, there are no common databases available forresearchers in this field. Therefore, we had to build our owndatabase with reasonable size. As depicted in Fig. 1, in the videocapturing stage, a single digital camera was used to acquire thegestures from signers in a video format. At this stage the video issaved in the AVI format in order to be analyzed in latter stages.When it comes to recognition, the video is streamed directly to therecognition engine. Unlike the systems discussed above, thesigners will use bare hands, as our system is machine vision based;i.e. no need for any gloves or any type of assisting devices. Thesigners must use the unified version of the Arabic sign languagewhich is approved by the Arab League of States [2].

Eighteen individuals adults from the deaf community volun-teered to perform the signs to generate samples for our study. Aportion of the video images was recorded from volunteers atSharjah Humanitarian City in United Arab Emirates, the remainingsamples were recorded at Sadaqa Association for Deaf people inJordan. The number of repetitions (gestures) collected from theinvolved participants is shown in Table 1. The total number of

Table 1Number of patterns per gesture for training and testing data.

No. Class (gesture) Off line

Number of training

samples for training data

1 Friend (n), 105

2 Neighbor (n), 105

3 Guest (n), 105

4 Gift (n), 105

5 Enemy (n), 105

6 Hello (v) 105

7 Welcome (v), 105

8 Thank you (v), 105

9 Get in (v), 105

10 Shame, 105

11 Home (n), 105

12 I (adv), 105

13 Ate (v), 105

14 Sleep (v), 105

15 Drink (v), 105

16 Wake up (v), 105

17 Hear (v), 105

18 Stop talking (v) 105

19 Sniff (v), 105

20 Help (v), 105

21 Yesterday (n), 105

22 Go (v), 105

23 Come (v), 105

24 Food (n), 45

25 Water (n), 45

26 How many?, 45

27 Where?, 45

28 Reason, 45

29 Money (n), 45

30 Love (v), 45

Total 2730

gestures collected for training and testing corresponding to a totalof 30 signs (classes) is 7860 samples. For off-line recognition, weused a total of 4045 gestures partitioned into 2730 for training and1315 for testing. For online recognition, we used a total of 3815gestures partitioned into 1765 for training and 2050 for testing. InTable 1, one can notice that the number of samples is not the samefor all gestures because the data were collected over a few monthsand in different palaces; not all participants were available all thetime.

In recoding video images, there is no limitation on the lightingor the background of the scene; neither does the clothing of theperson. The participant starts from silence, does the requiredgesture and ends in silence. In this phase, we have digitized thevideo stream previously recorded and converted it into amanageable format that was acquired using Windows MovieMaker at the rate of 25 frames per second. Image component of thevideo stream was considered, ignoring the audio component of thevideo stream.

After the video has been saved in windows media video format,the video is later segmented at the rate of 5 fps using a third-partymultimedia program (video converter) in the segmentation stage.This step of segmenting the video into consecutive image framestakes a lot of processing time. For offline mode, manualsegmentation is done, while automatic segmentation is done foronline mode.

As for background removal, we did initially the consecutiveimage subtraction for each sign of n frames. But that meant asubtraction of (n � 1) � 2 images each of size 76 800 (320 � 240)elements, then we take DCT [25] of n � 1 images in featureextraction stage. Later we minimized this job and combined it intothe feature extraction stage. We did that because the DCT is alinear transform, i.e., the difference of the DCTs equals DCT of the

On line

Number of

test samples

Number of training

samples for training data

Number of

test samples

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

45 60 80

40 55 30

40 55 30

40 55 30

40 55 30

40 55 30

40 55 30

40 55 30

1315 1765 2050

Fig. 3. Using DCT for feature extraction.


differences. Furthermore, the zonal coding used in featureextraction is spatial, i.e., it only depends on the location of theoriginal number in the original matrix. This means that applyingvector subtract on the resultant vector is the same as applying it onthe original image then generating the vector. Therefore, if we takeDCT of n images, we only need to subtract (n � 1) � 2 vectors, eachof size 50 elements as compared to subtracting 76 800 elements.This effectively shifted the background removal process after thefeature extraction is completed. The feature extraction andrecognition stages are discussed in more details in the followingsections.

3. DCT for feature extraction

As mentioned earlier, we used the discrete cosine transform(DCT) to extract features from the input gestures [25]. The DCT is aFourier-related transform similar to the discrete Fourier transform(DFT), which is also called finite Fourier transform (FFT) but usingonly real numbers. It is equivalent to a DFT of roughly twice thelength, operating on real data with even symmetry (since theFourier transform of a real and even function is real and even),where in some variants the input and/or output data are shifted byhalf a sample. The DCT represents an image as a sum of sinusoids ofvarying magnitudes and frequencies.

DCT is often used in signal and image processing, especially forlossy data compression, because it has a strong ‘‘energy compac-tion’’ property. Most of the signal information tends to beconcentrated in a few components of the DCT. For this reason,the DCT is often used in image compression applications. Forexample, the DCT is at the heart of the international standard lossyimage compression algorithm known as Joint PhotographicExperts Group or JPEG. In other words, DCT has the property that,for a typical image, most of the visually significant informationabout the image is concentrated in just a few coefficients of theDCT. More precisely, most of the information of the image is storedin the top left corner of the matrix, as shown in Fig. 3.

Formally, the discrete cosine transform is a linear, invertiblefunction F: Rn! Rn (where R denotes the set of real numbers), or

equivalently an n � n square matrix. This means that DCT(Image2 + Image 1) = DCT(Image 2) + DCT(Image 1), this feature will beused later on. This will also mean that (inverse-DCT(DCT(ima-ge)) = image. There are several variants of the DCT with slightlymodified definitions. The general formula for 1D DCT is defined as[25]

FðuÞ ¼ 2

N

� �1=2XN�1

i¼0

vðiÞ cospu

2Nð2iþ 1Þ

h if ðiÞ (1)

where N is the number of items in the image, p = 3.14, and thecoefficient v is given as

vðuÞ ¼1ffiffiffi2p for u ¼ 0

1 otherwise

0@ (2)

The above equation transfers an image consisting of N items inthe spatial domain (F(i)) into its frequency domain, (F(u)).However, images are two-dimensional signals. The general equa-tion for a 2D (N �M image) DCT is defined as

Fðu; vÞ ¼ 2

N

� �1=2 2

M

� �1=2XN�1

i¼0

XM�1

j¼0

vðiÞvð jÞ cospu

2Nð2iþ 1Þ

h i

cospv

2Mð2 jþ 1Þ

h if ði; jÞ (3)

where the coefficient v is as defined in Eq. (2).To apply the DCT, the input image is NxM, in which a pixel in

row i and column j is represented by f(i,j). The output of the DCT iscalled the DCT matrix in which F(u,v) is the DCT coefficient in row u

and column v.Zonal coding is a method of encoding data by which each

resultant element depends only and only on its location in theoriginal set, i.e. it depends on the zone, thus the name zonal codingcame in. In this work, the zigzag version of zonal coding wasimplemented in which we begin at the top of the matrix, and thenstart taking elements in a zigzag way, as illustrated in Fig. 3.

The reason behind the choice of zonal coding to be in this zigzagmethod is the fact that due to the energy compaction property of


DCT explained earlier on, most of the data are concentrated at thetop left part of the resultant matrix, which means that such anapproach would make sure that most of the relevant data havebeen encoded. It also means that the source of these coefficients isconstant as opposed to the usage of other encoding schemas suchas the threshold coding, which chooses the highest values andmight result in taking the coefficients from different places whichfurther complicates the recognition phase. The output of thefeature extraction stage of Fig. 3 using the DCT and zonal coding is afeature vector, Oi (observation) containing 50 elements. All featurevectors (also called observation sequence) belonging to onegesture will be used to build the corresponding HMM modelduring the training phase, or fed to the recognition stage during thetesting phase.

4. Basic of the HMM modeling

Different methods have been used for recognition such asspeech recognition [4,7] and handwriting recognition, but for signlanguages recognition HMMs have been used prominently andsuccessfully. Although there are substantial discussions on HMM[20], this section briefly outlines the algorithm. The fundamentalassumption of an HMM is that the process to be modeled isgoverned by a finite number of states and that these states changeonce per time step in a random but statistically predictable way. Tobe more precise, the state at any given time depends only on thestate at the previous time step. This is known as the Markovianassumption.

As shown in Fig. 4, in sign language recognition each gesture fora sign is modeled as a single HMM with N states per gesture (s1, s2,. . ., sN). An HMM is often described compactly as l = (p,A,B), where,A = {aij} is the state-transition probability distribution withaij = P(qt = sjjqt�1= si), 1 � i,j � N, and B = {bj(k)} is the observationsymbol probability distribution in state sj, with b jðkÞ ¼ Pðvktjqt ¼s jÞ; 1 � j � N; 1 � k � M (M is the number of distinct observa-tion symbols per state denoted by V = {v1, v2, . . ., vM}), and theinitial state distribution p = {pi}, where pi = P(q1 = si), 1 � i � N.

Once we have an HMM, there are three problems of interest. Thefirst two are concerned with recognition problems: evaluation—

Fig. 4. State HMM model

finding the probability of an observed sequence given a HMM anddecoding—finding the sequence of hidden states that most probablygenerated an observed sequence. The third problem is learning (orestimation)—generating a HMM given a sequence of observations.The evaluation and decoding are used in the recognition stage, whilethe learning is used in the training stage of HMM. The evaluationproblem is described as follows. Consider the problem where wehave a number of HMMs (that is, a set of (p,A,B) triples) describingdifferent systems, and a sequence of observations. We may want toknow which HMM most probably generated the given sequence. Forexample, we may have the ‘quest, ’ model and the ‘gift, ’model, since the behavior is likely to be different from gesture togesture—we may then hope to determine the gesture on the basis ofa sequence of postures (hand movements) observations. We use theforward algorithm to calculate the probability of an observationsequence given a particular HMM, and hence choose the mostprobable HMM. More precisely, given the observation sequenceO = O1, O2, . . ., OT, and the model l = (p,A,B), the problem is how tocompute P(Ojl), the probability that this observed sequence wasproduced by the model. This type of problem occurs in sign languagerecognition where a large number of Markov models will be used,each one modeling a particular isolated sign or phrase. Anobservation sequence is formed from a gesture, and this gestureis recognized by identifying the most probable HMM for theobservations. Practically, this problem can be solved with theforward–backward algorithm [11,20].

The learning problem associated with HMMs is to take asequence of observations (called the training set), known torepresent a set of hidden states, and fit the most probable HMM;that is, determine the (p,A,B) triple that most probably describeswhat is seen. This problem covers the estimation of the modelparameters l = (p,A,B), given one or more observation sequencesO. No analytical calculation method is known to date; the Viterbitraining represents a solution that iteratively adjusts the para-meters p, A and B [11,20]. In every iteration, the most likely paththrough an HMM is calculated. This path gives the new assignmentof observation vectors Ot to the states sj.

In decoding problem, our interest is to find the hidden statesthat generated the observed output. In many cases we are

for gesture ‘‘gift .

Fig. 5. The structure of the ArSL recognition system based on HMM.


interested in the hidden states of the model since they representsomething of value that is not directly observable. In other word,given the observation sequence O, what is the most likely sequenceof states S = s1, s2, . . ., sT according to some optimality criterion. Aformal technique for finding this best state sequence is called theViterbi algorithm [13,20].

5. System evaluation

We have applied the training method of HMM models asdescribed above by creating one HMM model per class (gestures),resulting in a total of 30 models. As shown in Fig. 5, the HMMmodels are called the codebook. The procedure of recognition is toselect the model from the codebook database that can wellrepresent the observation sequence. Given the observationsequence O = O1, O2, . . ., OT, the probability P(Ojlv) is computedfor each codebook lv in the codebook database. P(Ojlv) isapproximated as P(O, Qjlv), where Q is the best state sequenceamong the state spaces for the given O, which can be obtainedthrough the Viterbi algorithm search [5]. The recognized gesture isobtained using the following formula:

v� ¼ argmax1�v�V

PðOjlvÞ (5)

We have conducted several experiments to evaluate our ArSLrecognition system. The first experiment was for offline evaluationusing the training data collected in prior. Depending on thedatabase set, the system must try to recognize all samples for everyword where the total number of samples considered here is 2730.The overall system performance was 98.13%, which is reasonablyhigh. The number of misclassifications is 51 out of 2730 signswhich corresponds to 1.86% error rate.

The more appropriate indicative way of measuring the systemperformance is to test the system using different set from it wastrained with. What we have done in the second experiment waspresenting a total of 1315 gestures distributed among the signsshown in Table 1. The recognition accuracy went down a little bitcompared to the result obtained for the training set. However, thesystem has shown excellent performance with a low error rate of

Fig. 6. Confusion matrix for test data i

2.51% corresponding to a recognition rate of 97.4%. The resultednumber of misclassifications is 33 out of 1315 signs. The detailedper-sign misclassifications are shown in Fig. 6.

We now move to test the system using the online mode. In theonline experiments the training process is similar to that in theoffline mode except we used different set of training data. The totalsample dedicated for this phase was 1765 gestures collected fromall signers, as shown in Table 1. Then we asked the signers toperform gestures while the system is ready to recognize thegestures immediately through outputting the corresponding signs,text, and voice. The system was able to recognize the inputgestures in real time after processing the data, as described before.A total of 2050 gestures were tested online with an overallaccuracy of 93.8%. The total number of recognized gestures is 1923out of 2050 gestures, which corresponds to 6.19% error rate.Naturally, the system performance went down when used in theonline mode.

In all of the above experiments, the test data are obtained fromthe same signers who provided the training data. That is, theperformance presented above represents signer-dependent mode.

n signer-dependent/offline mode.

Fig. 7. Similarities between pairs of signs in ArSL.

Table 2Recognition rates of the ArSL system for the test data.

Signer-dependent Signer-independent

Offline Online Offline Online

97.4% 93.8% 94.2% 90.6%


Our final experiments aimed to evaluate the system for the signer-independent mode. We first test the system performance using theoffline mode. We trained the system using samples collected fromthe signers 1 to 8 with a total of 1500 samples. Then we used thesamples collected from the other 10 signers in the testing phasewith a total of 3545 samples. The system showed relatively highperformance with an overall recognition rate of 94.2%. The systemwas not able to recognize 205 gestures out of 3545 gestures. Thiscorresponds to 5.7% error rate. Next we move to test the online(real-time) mode using samples taken from all signers. Here wetrained the system using 1800 samples collected from 8 signers.Then, for testing purposes, we set up the system to recognize thegestures performed by the other 10 signers in real time. A total of1500 samples (30 gestures, 5 repetitions, 10 signers) are testedusing the online mode. When signer-independent mode wasused in configuring the system, it gave a lower accuracy rate asshown in Table 2. The overall system accuracy is 90.6% with a totalof 72 miss-recognized gestures. In general, for signer-independent

Table 3Comparison with similar offline, signer-dependent systems.

Instruments used Mode #. of si

ArSL–SVM Ref. [18] Cyber gloves Offline, signer-dependent 3

ArSL–HMM None: free hands Offline, signer-dependent 18

mode, this overall performance is acceptable if we keep in mind thecomplication of the recognition task.

When we analyzed the misclassification rates for each gesturein our experiments we found that there is a set of gestures showedhigh misclassification rates. This set consists of the gestures{home , ate , wake up , Sniff } as depicted in Fig. 7. Onecan see that the location, movement and orientation of thedominant hand are very similar in the considered gestures.Therefore, and due to the nature of the DCT, the observation(feature) vectors, O1,. . .,Ot, produced by the DCT algorithm aremost likely very close to each other. Thus, the system will getconfused between these gestures and provide relatively highererror rate for these particular gestures.

The results of our HMM-based recognition system are con-sidered superior over previously published results in the field ofArSL. A proper comparison can be done with the work presented byMohandes et al. [18], though they used different approach, setup,and database. Their system is based on a classifier called supportvector machine (SVM) which was tested using 120 isolated wordscolleted from signers wearing cyber gloves. To provide a faircomparison we perform the comparison based on recognizing 30isolated signs using the offline signer-dependent mode. We choosethis results because our system and their system are evaluatedusing 30 signs. As shown in Table 3, our HMM-based (referred to asArSL–HMM in Table 3) system performs much better than theSVM-based system (referred to as ArSL–SVM in Table 3). The

gners Recognition

rate

Reduction in

misclassification rate

Recognition improvement

for ArSL–HMM

75–81% 86% 20.2%

97.4%

Table 4Comparison with similar online, signer-independent systems.

Instruments used Mode #. of signers Recognition

rate

Reduction in

misclassification rate

Recognition improvement

for ArSL–HMM

CSL–HMM Ref. [11] 18-sensor data gloves

and two position

trackers

Online, signer-independent 6 80% 53% 13.25%

ArSL–HMM None: free hands Online, signer-indpendent 18 90.6%


achieved recognition rate is 97.4% which corresponds to 86%reduction in misclassifications and hence in error rate. It is worthmentioning that we considered the best recognition rate achievedby SVM-based system in our comparison; the recognition rate inSVM-based system ranges from 75% to 81% for 30 signs [5].

Further indirect comparison can be done between therecognition system developed for Chinese Sign Language (CSL)in [11] which was made based on two models: HMM and self-organizing feature maps (SOFM). In Table 4 we show the generalcomparison between our system (ArSL) and their HMM-basedsystem (CSL–HMM). Here we show the results obtained for signer-independent, online mode. Although CSL–HMM system usesgloves for data collection, still ArSL system showed superiorperformance over the CSL–HMM system. Our system demon-strated a significant increase in recognition rate of 13.25%corresponding to 53% reduction in misclassification rate. Further-more, our system does not need any type of calibration or specialhandling as in CSL–HMM based system.

Measuring the time response of the system is very beneficial inreal time applications. In Table 5 we show the time taken in eachstage of the system using online mode. These time measures aretaken when running our recognition system on a machine with2.4 GHz Celeron processor, 254 MB RAM, and Windows XPProfessional. The table shows the time taken for each gesturefrom the minute it is performed till the system provides the

Table 5System response time in seconds.

Gesture ID Image

processing

Features extraction

using DCT

Classification Total (s)

1 6.5128 1.109 0.265 7.887

2 8.6353 1.5 0.313 10.448

3 6.4309 1.172 0.406 8.009

4 4.4634 0.766 0.266 5.495

5 8.5718 1.438 0.281 10.291

6 6.5208 1.045 0.27 7.836

7 6.2258 1.123 0.271 7.62

8 5.3609 0.906 0.266 6.533

9 4.4176 0.766 0.265 5.449

10 0

11 4.8486 0.765 0.281 5.895

12 5.6512 0.953 0.281 6.885

13 4.4718 0.796 0.25 5.518

14 6.5308 1 0.27 7.801

15 5.4019 0.938 0.266 6.606

16 5.5194 0.906 0.25 6.675

17 5.5104 0.906 0.282 6.698

18 5.356 0.906 0.282 6.544

19 5.4938 0.922 0.25 6.666

20 5.4499 0.984 0.266 6.7

21 3.5718 0.696 0.26 4.528

22 3.5481 0.609 0.25 4.407

23 4.4296 0.766 0.281 5.477

24 6.5348 1.094 0.297 7.926

25 6.299 1.14 0.313 7.752

26 5.4033 0.969 0.25 6.622

27 7.5532 1.406 0.281 9.24

28 6.9327 1.688 0.297 8.918

29 6.5637 1.141 0.281 7.986

30 6.3925 1.125 0.266 7.784

Average 5.62 1.018 0.278 7.11

recognition. This includes the time required for segmentation,feature extraction, and classification. It can be seen that thefraction of time taken in feature extraction and classification stagescontributes little compared to the time spent in image processing.Image processing takes most of the response time that because itinvolves image segmentation where hand and facial regions mustbe detected. The time taken to recognize a gesture depends on itscomplexity and number of frames in it. For example, the systemtook less time to recognize 3-frame gestures (e.g. gestures 23 ‘‘go’’,and 24 ‘‘come’’) compared to the time taken to recognize 5-framegestures (e.g. gestures 15, ‘‘drink’’, and 16 ‘‘wake up’’). In all cases,the average response time for our system is around 7 s which is anacceptable response for our application.

6. Conclusion

In this paper we have shown an unencumbered way ofrecognizing ArSL using a single video camera. The system doesnot use any type of data/power gloves. Through the use of HMMmodels low error rates were achieved on both the training set andthe test set for isolated words and gestures. The system also hasshown low error rates for online operation mode in both signer-dependent and signer-independent cases. The recognition accu-racy for the system varied in the rage of 90.6–98.1% depending onthe operation mode under consideration. This achievement is ofimportance to the problem of Arabic sign language recognition,which had very limited research in its automated recognition. Thissystem is presents the first comprehensive study on isolated signsin ArSL. We have also compared the performance of our system topreviously published work using HMM-based classification. Weshowed the significant improvements achieved by our systemwhich reached to 86% reduction in misclassification rate whencompared with other existing systems in its best case.

Though we have reached into signer-independent Arabic signlanguage recognition system, there are still many issues to befurther investigated: First, how HMM can be used for recognizing aset of sentences. This issue is an ongoing research and underinvestigation. We have achieved reasonable results by imposingpauses between words in a sentence. Using a set of sentenceswithout pauses is our goal and under investigation. Second, systemaccuracy can be improved considering different feature sets.Features can be extracted using spatial or frequency domains. Ourundergoing research shows that many feature extraction methodsin both domains are promising.

Acknowledgments

The authors acknowledge the support of Sharjah HumanitarianCity, Sharjah, UAE, and Sadaqa, Association for Deaf, Jordan. Specialthanks to Mr. Saleh Odeh of Sharjah Humanitarian City for his timeand valuable assistance he has provided.

References

[1] O. Al-Jarrah, A. Halawani, Recognition of gestures in Arabic sign language usingneuro-fuzzy systems, Artificial Intelligence 2 (133) (2001) 117–138.

[2] The Arab Sign Language Dictionary, Arab League of States, 2001.


[3] M. AL-Rousan, M. Hussain, Automatic recognition of Arabic sign language fingerspelling, International Journal of Computers and Their Applications (IJCA)—Spe-cial issue on Fuzzy systems 8 (2) (2001) 80–88.

[4] Y.A. Alotaibi, Investigating spoken Arabic digits in speech recognition setting,Information Sciences 173 (2005) 115–139.

[5] K. Assaleh, M. AL-Rousan, Recognition of Arabic sign language alphabet usingpolynomial classifier, EURSIP Journal on Applied Signal Processing 13 (2005)2136–2145.

[6] B. Bauer, K.F. Kraiss, Towards an automatic sign language recognition systemusing subunits, in: Proceedings of the International Gesture Workshop, 2001, pp.64–75.

[7] W.M. Campbell, K.T. Assaleh, C.C. Broun, Speaker recognition with poly-nomial classifiers, IEEE Transactions on Speech and Audio Processing 4 (10)(2002) 205–212.

[8] G.L. Fang, W. Gao, J.Y. Ma, Signer-independent sign language recognition based onSOFM/HMM, in: Proceedings of the IEEE ICCV Workshop Recognition, Analysisand Tracking of Faces and Gestures in Real-Time Systems, 2001, pp. 90–95.

[9] G.L. Fang, W. Gao, A SRN/HMM system for signer independent continuous signlanguage recognition, in: Proceedings of the Fifth International Conference onAutomatic Face and Gesture Recognition, 2002, pp. 312–317.

[10] W. Gao, J.Y. Ma, J.Q. Wu, C.L. Wang, Sign language recognition based on HMM/ANN/DP, International Journal of Pattern Recognition Artificial Intelligent 14 (5)(2000) 587–602.

[11] W. Gao, G. Fang, D. Zhao, Y. Chen, A Chinese sign language recognition systembased on SOFM/SRN/HMM, Pattern Recognition 37 (2004) 2389–2402.

[12] K. Grobel, M. Assan, Isolated sign language recognition using hidden Markovmodels, in: Proceedings of the International Conference on System, Man andCybernetics, 1997, pp. 162–167.

[13] H. Hienz, B. Bauer, K.F. Krais, HMM-based continuous sign language recognitionusing stochastic grammar, in: Proceedings of GW’99, LNAI 1739, 1999, pp. 185–196.

[14] M. Halawani, Arabic sign language translation system on mobile devices, IJCSNSInternational Journal of Computer Science and Network Security 8 (January (1))(2008) 251–256.

[15] M.W. Kadous, Machine recognition of Auslan signs using PowerGloves: towardslarge-lexicon recognition of sign language, in: Proceedings of the Workshop onthe Integration of Gestures in Language and Speech, 1996, pp. 165–174.

[16] J.S. Kim, W. Jang, Z. Bien, A dynamic gesture recognition system for the Koreansign language (KSL), IEEE Transactions on Systems, Man, and Cybernetics 26 (2)(1996) 354–359.

[17] R.H. Liang, M. Ouhyoung, A real-time continuous gesture recognition system forsign language, in: Proceedings of the Third International Conference on AutomaticFace and Gesture Recognition, 1998, pp. 558–565.

[18] M. Mohandes, S.A. Buraiky, T. Halawani, S. Al-Buayat, Automation of the Arabicsign language recognition, in: Proceedings of the 2004 International Conferenceon Information and Communication Technology (ICT04), April, 2004, pp. 479–480.

[19] V.I. Pavlovic, R. Sharma, T.S. Huang, Visual interpretation of hand gestures forhuman–computer interaction: a review, IEEE Transactions on Pattern Analysisand Machine Intelligence 19 (7) (1997) 677–695.

[20] T. Starner, J. Weaver, A. Pentland, Real-time American sign language recognitionusing desk and wearable computer-based video, IEEE Transactions on PatternAnalysis and Machine Intelligence 20 (12) (1998) 1371–1375.

[21] J.J. Triesch, C. Malsburg, A system for person-independent hand posture recogni-tion against complex backgrounds, IEEE Transactions on Pattern Analysis andMachine Intelligence 23 (12) (2001) 1449–1453.

[22] P. Vamplew, A. Adams, Recognition of sign language gestures using neural net-works, Australian Journal of Intelligent Information Processing Systems 5 (2)(1998) 94–102.

[23] C. Vogler, D. Metaxas, A framework for recognizing the simultaneous aspects ofAmerican sign language, Computer Vision and Image Understanding 81 (3) (2001)358–384.

[24] Y. Wu, T.S. Huang, Vision-based gesture recognition: a review, in: Proceedings ofthe International Gesture Workshop, 1999, pp. 103–115.

[25] C. Yoeffler, A. Ligtenberg, G. Moschytz, Practical fast 1-D DCT algorithms with 11multiplications, in: Proceedings of the International Conference on Acoustics,Speech, and Signal Processing (ICASSP’89), 1989, pp. 988–991.

video-based signer-independent arabic sign language recognition using hidden markov models

Documents