hand gesture recognition using haar-wavelet
TRANSCRIPT
1
HAND GESTURE RECOGNITION
SYSTEM USING HAAR WAVELET
A PROJECT REPORT
Submitted by
J.JENKIN WINSTON (Reg. No: 96207106036)
M.MARIA GNANAM (Reg.No:96207106056)
R.RAMASAMY (Reg.No:96207106306)
in partial fulfillment for the award of the degree
of
BACHELOR OF ENGINEERING
in
ELECTRONICS AND COMMUNICATION ENGINEERING
NATIONAL ENGINEERING COLLEGE, KOVILPATTI
ANNA UNIVERSITY OF TECHNOLOGY,
TIRUNELVELI - 627 007.
March 2011
2
ANNA UNIVERSITY OF TECHNOLOGY, TIRUNELVELI
BONAFIDE CERTIFICATE
Certified that this project report titled “HAND GESTURE RECOGNITON
SYSTEM USING HAAR WAVELET” is the bonafide work of J.JENKIN
WINSTON (96207106036), M.MARIA GNANAM (96207106056),
R.RAMASAMY (96207106306) who carried out the project work under my
supervision.
Signature Signature
Head of the Department Supervisor
Dr.V.Vijayarangan, B.E., M.Sc.,(Engg),Ph.D., Mr.M.Sundaram, M.E.,
Department of ECE Sr.Lecturer/ECE,
National Engineering College, National Engineering
College,
Kovilpatti -628 503 Kovilpatti -628 503
Submitted for Viva-Voce Examination held at NATIONAL
ENGINEERING COLLEGE, Kovilpatti on ____________
Internal Examiner External
Examiner
3
ACKNOWLEDGEMENT
First and foremost we express our wholehearted gratitude to the Almighty
for having given wisdom and courage to take over this project.
We wish to express our sincere thanks to the Director
Dr.Kn.K.S.K.Chockalingam, B.E., M.Sc(Engg)., Ph.D., who helped us in
carrying our project successfully.
We would like to express our sincere thanks to our former Principal
Dr.N.S.Marimuthu, B.E., M.Sc(Engg)., Ph.D., for providing us this
opportunity to do this project.
Our heartfelt acknowledgement goes to the Professor and Head of the
Department of Electronics and Communication Engineering,
Dr.V.Vijayarangan, B.E., M.Sc(Engg)., Ph.D., for his valuable and consistent
encouragement for carrying out the project.
Our gratitude is no less for our project coordinators Mr.N.Arumugam,
M.E., Assistant Professor and Mrs.S.D.Jayavathi, M.E., Senior Lecturer, in the
Department of Electronics and Communication for their encouragement.
We express our deepest gratitude to our guide Mr.M.Sundaram, M.E.,
Senior lecturer, in the Department of Electronics and Communication
Engineering, for rendering excellent guidance and for being extremely kind and
approachable in nature, being a great source of support and encouragement
throughout the course of the project work.
4
We hereby acknowledge the efforts of all staff members, technicians of
Electronics and Communication Engineering Department, whose help was
instrumental in completion of my project.
Also we would like to express our hearty thanks to our beloved parents and
dear friends for their valuable suggestions and cooperation for the project.
5
ABSTRACT
With the increasing growth of technology and the entrance into the digital
age, we handle difficulties concerned with handicapped people in a new
approach. The sign language they use for communication is not understandable
by everyone. This isolates them from the speaking community. So we have aimed
at providing an effective means of communication for the dumb people by
programming a gesture recognition system with the concept of image processing.
Our algorithm is developed in Matlab to recognize static hand gestures, namely, a
subset of American Sign Language (ASL). It is fairy robust to background cluster
and uses skin color for hand gesture tracking and recognition. In this project we
have reduced the database size by normalizing the orientation of hands using the
idea of principal axis. We have taken correlation factor to improve the degree of
recognition. Every human has a hand geometry different from one another. So to
tradeoff this we are using a transform that converts an image into a feature
vector, which will then be compared with the feature vectors of a training set of
gestures. Improvising on all this features would decrease the computation time.
This method is very compact and handy compared to other hand gesture
recognition systems.
6
TABLE OF CONTENTS
CHAPTER
NO
TITLE
PAGE NO
ABSTRACT
i
LIST OF FIGURES
ii
LIST OF ABBREVIATIONS
iii
1 INTRODUCTION
1.1 Prelude
1
1
1.2 Need for sign language 4
1.3 American sign language 5
1.4 Gesture recognition 6
1.4.1 Gesture recognition and pen computing 7
1.4.2 Gesture types 7
1.4.3 Uses
8
2 LITERATURE SURVEY
11
3 BACKGROUND 12
3.1 Existing system 12
3.2 Problem statement
13
4 METHODOLOGY 14
4.1 Image capturing devices 14
4.1.1 Challenges 15
7
4.2 Significance of grayscale images 16
4.2.1 Grayscale as single channel of
multichannel color images
17
4.3 Hand segmentation 18
4.3.1 Threshold selection 20
4.3.2 Adaptive thresholding 20
4.3.3 Multiband thresholding
20
4.4 Morphological operation 21
4.4.1 Structuring element 21
4.4.2 Image closing 23
4.4.3 Effect of image closing
24
4.5 Image registration 26
4.5.1 Algorithm classifications 26
4.5.1.1 Intensity based vs feature based 27
4.5.1.2 Spatial vs Frequency domain methods 28
4.5.1.3 Single vs Multi-modality methods 28
4.5.1.4 Automatic vs Interactive methods 29
4.5.2 Uncertainity
4.5.3 Transformation methods
29
4.5.4 Radon transform
30
8
4.6 Feature Extraction 31
4.6.1 Wavelets 33
4.6.2 Wavelet transform 34
4.6.3 The discrete wavelet transform 34
4.6.4 2D-Discrete wavelet transform
35
5 OVERVIEW OF THE PROJECT
5.1 An overlay of our algorithm
37
37
5.2 Proposed work
38
6 SOFTWARE DESCRIPTION 41
6.1 Introduction
6.2 Features of Matlab
41
42
6.2.1 Command window 42
6.2.2 Graphics window 42
6.2.3 Edit window 42
6.2.4 Input output 42
6.2.5 Data type 42
6.2.6 Dimensioning 43
6.2.7 Case sensitivity
43
6.3 Images in Matlab
43
6.4 File types 44
6.4.1 M-files 44
9
6.4.2 Script files 44
7
6.4.3 Function files
6.4.4. MAT-files
SIMULATION RESULT
45
45
46
8 CONCLUSION
52
9 REFERENCES
53
10
LIST OF FIGURES
FIGURE NO.
TITLE
PAGE NO.
1.1
4.1
4.2
4.3
4.4
4.5
4.6
4.7
5.1
7.1
7.2
7.3
7.4
7.5
7.6
ASL examples
A model gray image
Three channels of a RGB image
Original image
Example of a threshold effect used on an image
Structuring element
Effect of closing using 3x3 square structuring
element
Multi-resolution expansion using Haar wavelet
Overlay of our algorithm
Resized image
Gray scale image
Segmented hand
Morphologically operated image
Normalized image
Horizontal vector of DWT
5
17
18
19
19
23
25
32
37
46
47
48
49
50
51
11
LIST OF ABBREVIATIONS
ASL
CAD
PUI
GUI
HMI
MRI
CT
PET
DWT
STFT
MATLAB
American Sign Language
Computer Aided Design
Perceptual User Interface
Graphical User Interface
Human Machine Interface
Magnetic Resonance
Imaging
Computed Tomography
Positron Emission
Tomography
Discrete Wavelet Transform
Short Time Fourier
Transform
Matrix Laboratory
12
CHAPTER 1
INTRODUCTION
1.1. PRELUDE
Since the existing common computer devices are adequate. It is also now
that computers have been so tightly integrated with everyday life, that new
applications and hardware are constantly introduced. The means of
communicating with computers at the moment are limited to keyboards, mice,
light pen, trackball, keypads etc. These devices have grown to be familiar but
inherently limit the speed and naturalness with which we interact with the
computer.
Recently, there has been a surge in interest in recognizing human hand
gestures. Hand gesture recognition has various applications like computer games,
machinery control (e.g. crane), and thorough mouse replacement. One of the
most structured sets of gestures belongs to sign language. In sign language, each
gesture has an assigned meaning (or meanings).
Computer recognition of hand gestures may provide a more natural-
computer interface, allowing people to point, or rotate a CAD model by rotating
their hands. Hand gestures can be classified in two categories as static and
dynamic. A static gesture is a particular hand configuration and pose, represented
by a single image. A dynamic gesture is a gesture, represented by a sequence of
images. We will focus on the recognition of static images.
The reliance on sign language among dumb people communities result in
linguistic isolation from the general community. The overwhelming majority of
hearing people do not understand sign language. Many approaches for effective
man-machine communication have been proposed such as voice, face, iris, retinal
13
scans and gesture recognition systems. Gesture recognition, along with facial
recognition, voice recognition, eye tracking and lip movement recognition are
components of what developers refer to as a perceptual user interface (PUI). The
goal of PUI is to enhance the efficiency and ease of use for the underlying logical
design of a stored program, a design discipline known as usability. In personal
computing, gestures are most often used for input commands.
Despite the use of face and voice features, hands require less complexity in
terms of imaging conditions. Consequently hand based recognition is friendlier
and it is less prone to disturbances and robust to environmental conditions.
Our goal is to offer a sign recognition system as another choice of
augmenting communication between dumb people and the speaking community.
This wearable system would capture and recognize the dumb user’s signing. The
user could then cue the system to generate text or speech. Recognizing gestures
as input allows computers to be more accessible for the physically-impaired and
makes interaction more natural in a 3D virtual world environment. Hand and
body gestures can be amplified by a controller that contains accelerometers and
gyroscopes to sense tilting, rotation and acceleration of movement or the
computing device can be outfitted with a camera so that software in the device
can recognize and interpret specific gestures. Conventional methods used in hand
gesture recognition systems are glove based techniques with embedded
accelerometer and multiple sensors and computer vision based technique.
The use of accelerometer demands for hardware components and power
supply. In hand gesture recognition using a sensing glove with 6 embedded
accelerometers, it recognizes 28 static hand gestures and the computation time is
1 characters/second. However, this algorithm is not efficient to be applied in
realtime. Another recognition system by using colored gloves and neural
networks algorithm was introduced. But the success rate ranges from 70% to
14
93%. Although these systems can recognize hand gestures, the wearing of a
sensory glove is not convenient for daily application.
For computer vision based techniques, one or a set of cameras are utilized
to capture hand images for recognition. It is based on computer vision techniques
without restricting backgrounds or using any markers. This method first separates
the region of hand gesture from complex background images by measuring
entropy from adjacent frames images. A hand gesture is then recognized by the
approach of improved centroidal profile. However, mis-recognitions can be
caused by hand gestures with similar spatial features. Therefore the number of
hand gestures that can be recognized by the proposed algorithm is limited. To be
an effective vision system, it should be glove-free, fast, small database and
accurate. Moreover the use of computer vision based technique increases the
complexity of image recognition.
In addition to the technical challenges of implementing gesture
recognition, there are also social challenges. Gestures must be simple, intuitive
and universally acceptable. The study of gestures and other nonverbal types of
communication is known as kinesics.
The key problem in gesture interaction is how to make hand gestures
understood by computers. The approaches present can be mainly divided into
“Data-Glove based” and “Vision Based” approaches. The Data-Glove based
methods use sensor devices for digitizing hand and finger motions into multi-
parametric data. The extra sensors make it easy to collect hand configuration and
movement. However, the devices are quite expensive and bring much
cumbersome experience to the users. In contrast, the Vision Based methods
require only a camera, thus realizing a natural interaction between humans and
computers without the use of any extra devices. These systems tend to
15
complement biological vision by describing artificial vision systems that are
implemented in software and hardware. This poses a challenging problem as
these systems need to be background invariant, lighting insensitive, person and
camera independent to achieve real time performance. Moreover, such systems
must be optimized to meet the requirements, including accuracy and robustness.
In this project, a new approach in realtime hand gesture recognition is
developed. It is a recognition algorithm based on Haar wavelet representation.
Hands are extracted by a skin color approach rather than user input. The problem
of hand orientation in the image is also solved by utilizing the idea of axis of
elongation. It helps keeping the database small by standardizing the hand gestures
in fixed orientations. We then introduce a new approach to disperse hand features
in the image which shows a promotion in the success rate.
1.2. NEED FOR SIGN LANGUAGE
Creating a proper sign language (ASL – American Sign Language at this
case) dictionary is not the desired result at this point. This would combine
advanced grammar and syntax structure understanding of the system, which is
outside the scope of this project. The American Sign Language will be used as
the database since it’s a tightly structured set. From that point further applications
can be suited. The distant (or near) future of computer interfaces could have the
usual input devices and in conjunction with gesture recognition some of the
user’s feelings would be perceived as well.
Taking ASL recognition further a full real-time dictionary could be created
with the use of video. As mentioned before this would require some Artificial
Intelligence for grammar and syntax purposes.
16
Another application is huge database annotation. It is far more efficient
when properly executed by a computer, than by a human.
1.3. AMERICAN SIGN LANGUAGE
American Sign Language is the language of choice for most deaf people in
the United States. It is part of the “deaf culture” and includes its own system of
puns, inside jokes, etc. However, ASL is one of the many sign languages of the
world. As an English speaker would have trouble understanding someone
speaking Japanese, a speaker of ASL would have trouble understanding the Sign
Language of Sweden. ASL also has its own grammar that is different from
English. ASL consists of approximately 6000 gestures of common words with
finger spelling used to communicate obscure words or proper nouns. Finger
spelling uses one hand and 26 gestures to communicate the 26 letters of the
alphabet. Some of the signs can be seen in fig1.1.
Fig 1.1 ASL examples
Another interesting characteristic that will be ignored by this project is the
ability that ASL offers to describe a person, place or thing and then point to a
place in space to temporarily store for later reference.
17
ASL uses facial expressions to distinguish between statements, questions
and directives. The eyebrows are raised for a question, held normal for a
statement, and furrowed for a directive. There has been considerable work and
research in facial feature recognition, they will not be used to aid recognition in
the task addressed. This would be feasible in a full real-time ASL dictionary.
1.4. GESTURE RECOGNITION
Gesture recognition is a language technology with the goal of interpreting
human gestures via mathematical algorithms. Gestures can originate from any
bodily motion or state but commonly originate from the face or hand. Current
focuses in the field include emotion recognition from the face and hand gesture
recognition. Many approaches have been made using cameras and computer
vision algorithms to interpret sign language. However, the identification and
recognition of posture, gait, proxemics, and human behaviors is also the subject
of gesture recognition techniques.
Gesture recognition can be seen as a way for computers to begin to
understand human body language, thus building a richer bridge between
machines and humans than primitive text user interfaces or even GUIs (graphical
user interfaces), which still limit the majority of input to keyboard and mouse.
Gesture recognition enables humans to interface with the machine (HMI)
and interact naturally without any mechanical devices. Using the concept of
gesture recognition, it is possible to point a finger at the computer screen so that
the cursor will move accordingly. This could potentially make conventional input
devices such as mouse, keyboards and even touch-screens redundant.
18
Gesture recognition can be conducted with techniques from computer
vision and image processing.
The literature includes ongoing work in the computer vision field on
capturing gestures or more general human pose and movements by cameras
connected to a computer.
1.4.1. GESTURE RECOGNITION AND PEN COMPUTING
In some literature, the term gesture recognition has been used to refer more
narrowly to non-text-input handwriting symbols, such as inking on a graphics
tablet, multi-touch gestures, and mouse gesture recognition. This is computer
interaction through the drawing of symbols with a pointing device cursor.
1.4.2. GESTURE TYPES
In computer interfaces, two types of gestures are distinguished:
� Offline gestures:
Those gestures that are processed after the user interaction with the
object. An example is the gesture to activate a menu.
� Online gestures:
Direct manipulation gestures. They are used to scale or rotate a
tangible object.
19
1.4.3. USES
Gesture recognition is useful for processing information from humans
which is not conveyed through speech or type. As well, there are various types of
gestures which can be identified by computers.
� Sign language recognition:
Just as speech recognition can transcribe speech to text, certain
types of gesture recognition software can transcribe the symbols
represented through sign language into text.
� For socially assistive robotics:
By using proper sensors (accelerometers and gyros) worn on the
body of a patient and by reading the values from those sensors, robots can
assist in patient rehabilitation. The best example can be stroke
rehabilitation.
� Directional indication through pointing:
Pointing has a very specific purpose in our society, to reference an
object or location based on its position relative to ourselves. The use of
gesture recognition to determine where a person is pointing is useful for
identifying the context of statements or instructions. This application is of
particular interest in the field of robotics.
20
� Control through facial gestures:
Controlling a computer through facial gestures is a useful
application of gesture recognition for users who may not physically be able
to use a mouse or keyboard. Eye tracking in particular may be of use for
controlling cursor motion or focusing on elements of a display.
� Alternative computer interfaces:
Foregoing the traditional keyboard and mouse setup to interact with
a computer, strong gesture recognition could allow users to accomplish
frequent or common tasks using hand or face gestures to a camera.
� Immersive game technology:
Gestures can be used to control interactions within video games to
try and make the game player's experience more interactive or immersive.
� Virtual controllers:
For systems where the act of finding or acquiring a physical
controller could require too much time, gestures can be used as an
alternative control mechanism. Controlling secondary devices in a car, or
controlling a television set are examples of such usage.
� Affective computing:
In affective computing, gesture recognition is used in the process of
identifying emotional expression through computer systems.
21
� Remote control:
Through the use of gesture recognition, "remote control with the
wave of a hand" of various devices is possible. The signal must not only
indicate the desired response, but also which device to be controlled.
22
CHAPTER 2
LITERATURE SURVEY
A hand gesture analysis system based on a three-dimensional hand
skeleton model with 27 degrees of freedom was developed by Lee and Kunii.
They incorporated five major constraints based on the human hand kinematics to
reduce the model parameter space search. To simplify the model matching,
specially marked gloves were used.
Full ASL recognition systems (words, phrases) incorporate data gloves.
Takashi and Kishino discuss a Data glove-based system that could recognize 34
of the 46 Japanese gestures (user dependent) using a joint angle and hand
orientation coding technique. From their paper, it seems the test user made each
of the 46 gestures 10 times to provide data for principle component and cluster
analysis. A separate test was created from five iterations of the alphabet by the
user, with each gesture well separated in time. While these systems are
technically interesting, they suffer from a lack of training.
Excellent work has been done in support of machine sign language
recognition by Sperling and Parish, who have done careful studies on the
bandwidth necessary for a sign conversation using spatially and temporally sub-
sampled images. Point light experiments (where “lights” are attached to
significant locations on the body and just these points are used for recognition),
have been carried out by Poizner.
23
CHAPTER 3
BACKGROUND
3.1. EXISTING SYSTEM
The key problem in gesture interaction is how to make hand gestures
understood by computers. The approaches present can be mainly divided into
“Data-Glove based”, “Vision Based” and “Analysis of Drawing Gestures”
approaches.
Research on hand gestures can be classified into three categories. The first
category, glove based analysis, employs sensors (mechanical or optical) attached
to a glove that transduces finger flexions into electrical signals for determining
the hand posture. The relative position of the hand is determined by an additional
sensor. This sensor is normally a magnetic or an acoustic sensor attached to the
glove. The methods use sensor devices for digitizing hand and finger motions
into multi-parametric data. The extra sensors make it easy to collect hand
configuration and movement. However, the devices are quite expensive and bring
much cumbersome experience to the users. For some data glove applications,
look-up table software toolkits are provided with the glove to be used for hand
posture recognition.
The second category, vision based analysis, is based on the way human
beings perceive information about their surroundings, yet it is probably the most
difficult to implement in a satisfactory way. Several different approaches have
been tested so far. One is to build a three-dimensional model of the human hand.
The model is matched to images of the hand by one or more cameras, and
24
parameters corresponding to palm orientation and joint angles are estimated.
These parameters are then used to perform gesture classification. Another Vision
Based method requires only a camera, thus realizing a natural interaction between
humans and computers without the use of any extra devices. These systems tend
to complement biological vision by describing artificial vision systems that are
implemented in software and hardware. This poses a challenging problem as
these systems need to be background invariant, lighting insensitive, person and
camera independent to achieve real time performance. Moreover, such systems
must be optimized to meet the requirements, including accuracy and robustness.
The third category, analysis of drawing gestures, usually involves the use
of a stylus as an input device. Analysis of drawing gestures can also lead to
recognition of written text. The vast majority of hand gesture recognition work
has used mechanical sensing, most often for direct manipulation of a virtual
environment and occasionally for symbolic communication. Sensing the hand
posture mechanically has a range of problems, however, including reliability,
accuracy and electromagnetic noise. Visual sensing has the potential to make
gestural interaction more practical, but potentially embodies some of the most
difficult problems in machine vision. The hand is a non-rigid object and even
worse self-occlusion is very usual.
3.2. PROBLEM STATEMENT
In the existing system we make use of sensors, accelerometer and sensing
glove. In all these methods more number of hardware components are required.
The sweat produced in hand would reduce the efficiency of the tactile sensors.
The efficiency of the sensors would also diminish due to wear caused due to
aging. Moreover we could not expect people to move around with sensing glove.
25
CHAPTER 4
METHODOLOGY
4.1. IMAGE CAPTURING DEVICES
The ability to track a person's movements and determine what gestures
they may be performing can be achieved through various tools. Although there is
a large amount of research done in image/video based gesture recognition, there
is some variation within the tools and environments used between
implementations.
� Depth-aware cameras:
Using specialized cameras such as time-of-flight cameras, one can
generate a depth map of what is being seen through the camera at a short
range, and use this data to approximate a 3-D representation of what is
being seen. These can be effective for detection of hand gestures due to
their short range capabilities.
� Stereo cameras:
Using two cameras whose relations to one another are known, a 3d
representation can be approximated by the output of the cameras. To get
the cameras' relations, one can use a positioning reference such as a lexian-
stripe or infrared emitters. In combination with direct motion measurement
(6D-Vision) gestures can directly be detected.
26
� Controller-based gestures:
These controllers act as an extension of the body so that when
gestures are performed, some of their motion can be conveniently captured
by software. Mouse gestures are one such example, where the motion of
the mouse is correlated to a symbol being drawn by a person's hand, as is
the Wii Remote, which can study changes in acceleration over time to
represent gestures.
� Single camera:
A normal camera can be used for gesture recognition where the
resources/environment would not be convenient for other forms of image-
based recognition. Although not necessarily as effective as stereo or depth
aware cameras, using a single camera allows a greater possibility of
accessibility to a wider audience.
4.1.1. CHALLENGES
There are many challenges associated with the accuracy and usefulness of
gesture recognition software. For image-based gesture recognition there are
limitations on the equipment used and image noise. Images or video may not be
under consistent lighting, or in the same location. Items in the background or
distinct features of the users may make recognition more difficult.
The variety of implementations for image-based gesture recognition may
also cause issue for viability of the technology to general usage. For example, an
algorithm calibrated for one camera may not work for a different camera. The
27
amount of background noise also causes tracking and recognition difficulties,
especially when occlusions (partial and full) occur. Furthermore, the distance
from the camera, and the camera's resolution and quality, also cause variations in
recognition accuracy.
In order to capture human gestures by visual sensors, robust computer
vision methods are also required, for example for hand tracking and hand posture
recognition or for capturing movements of the head, facial expressions or gaze
direction.
The recognition problem is approached through a matching process in
which the segmented hand is compared with all the postures in the system’s
memory using the Hausdorff distance. The system‘s visual memory stores all the
recognizable postures, their distance transform, their edge map and morphologic
information. A faster and more robust comparison is performed thanks to this
data, properly classifying postures, even those which are similar, saving valuable
time needed for real time processing. The postures included in the visual memory
may be initialized by the human user, learned or trained from previous tracking
hand motion or they can be generated during the recognition process.
4.2. SIGNIFICANCE OF GRAYSCALE IMAGES
The image captured by the camera is in RGB form. Inorder to reduce
complexity in hand segmentation we convert the RGB to gray scale images.
A grayscale (or gray level) image is simply one in which the only colors
are shades of gray. The reason for differentiating such images from any other sort
of color image is that less information needs to be provided for each pixel. In fact
28
a `gray' color is one in which the red, green and blue components all have equal
intensity in RGB space, and so it is only necessary to specify a single intensity
value for each pixel, as opposed to the three intensities needed to specify each
pixel in a full color image.
Fig 4.1 A model gray image
Often, the grayscale intensity is stored as an 8-bit integer giving 256
possible different shades of gray from black to white. If the levels are evenly
spaced then the difference between successive gray levels is significantly better
than the gray level resolving power of the human eye.
4.2.1. GRAYSCALE AS SINGLE CHANNEL OF MULTICHANNEL
COLOUR IMAGES
Color images are often built of several stacked color channels, each of
them representing value levels of the given channel. For example, RGB images
are composed of three independent channels for red, green and blue primary
color components.
Here is an example of color channel splitting of a full RGB color image.
The column at left shows the isolated color channels in natural colors, while at
right there are their grayscale equivalences:
29
Fig 4.2 Three channels of a RGB image
The reverse is also possible: to build a full color image from their separate
grayscale channels. By mangling channels, using offsets, rotating and other
manipulations, artistic effects can be achieved instead of accurately reproducing
the original image.
4.3. HAND SEGMENTATION
Thresholding is the simplest method of image segmentation. From a
grayscale image, thresholding can be used to create binary images. The key
parameter in the thresholding process is the choice of the threshold value. During
the thresholding process, individual pixels in an image are marked as “object”
pixels if their value is greater than some threshold value (assuming an object to
be brighter than the background) and as “background” pixels otherwise. This
convention is known as threshold above. Variants include threshold below, which
is opposite of threshold above; threshold inside, where a pixel is labeled "object"
30
if its value is between two thresholds; and threshold outside, which is the
opposite of threshold inside. Typically, an object pixel is given a value of “1”
while a background pixel is given a value of “0.” Finally, a binary image is
created by coloring each pixel white or black, depending on a pixel's label's.
Fig 4.3 Original Image
Fig 4.4 Example of a threshold effect used on an image
31
4.3.1. THRESHOLDING SELECTION
The key parameter in the thresholding process is the choice of the
threshold value (or values, as mentioned earlier). Several different methods for
choosing a threshold exist; users can manually choose a threshold value, or a
thresholding algorithm can compute a value automatically, which is known as
automatic thresholding. A simple method would be to choose the mean or median
value, the rational being that if the object pixels are brighter than the background,
they should also be brighter than the average. In a noiseless image with uniform
background and object values, the mean or median will work well as threshold,
however, this will generally not be the case. A more sophisticated approach
might be to create a histogram of the image pixel intensities and use the valley
point as the threshold. The histogram approach assumes that there is some
average value for the background and object pixels, but that the actual pixel
values have some variation around these average values. However, this may be
computationally expensive, and image histograms may not have clearly defined
valley points, often making the selection of an accurate threshold difficult.
4.3.2. ADAPTIVE THRESHOLDING
Thresholding is called adaptive thresholding when a different
threshold is used for different regions in the image. This may also be known as
local or dynamic thresholding.
4.3.3. MULTIBAND THRESHOLDING
Color images can also be thresholded. One approach is to
designate a separate threshold for each of the RGB components of the image and
then combine them with an AND operation. This reflects the way the camera
32
works and how the data is stored in the computer, but it does not correspond to
the way that people recognize color.
Therefore, it is easy to design a threshold value for a grayscale
image rather than the color image.
4.4. MORPHOLOGICAL OPERATION
While point and neighborhood operations are generally designed to alter
the look or appearance of an image for visual considerations, morphological
operations are used to understand the structure or form of an image. This usually
means identifying objects or boundaries within an image. Morphological
operations play a key role in applications such as machine vision and automatic
object detection.
4.4.1. STRUCTURINING ELEMENT
In mathematical morphology, a structuring element is a shape, used to
probe or interact with a given image, with the purpose of drawing conclusions on
how this shape fits or misses the shapes in the image. It is typically used in
morphological operations, such as dilation, erosion, opening, and closing, as well
as the hit-or-miss transform.
According to Georges Matheron, knowledge about an object depends on
the manner in which we probe (observe) it. In particular, the choice of a certain
structuring element for a particular morphological operation influences the
information one can obtain. There are two main characteristics that are directly
related to structuring elements.
33
� Shape
For example, the structuring element can be a ``ball" or a line;
convex or a ring, etc. By choosing a particular structuring element, one sets
a way of differentiating some objects from others, according to their shape
or spatial orientation.
� Size
For example, one structuring element can be a square or a square.
Setting the size of the structuring element is similar to setting the
observation scale, and setting the criterion to differentiate image objects or
features according to size.
SE = strel ('disk', R, N) creates a flat, disk-shaped structuring element,
where R specifies the radius. R must be a non-negative integer. N must be 0, 4, 6,
or 8.
When N is greater than 0, the disk-shaped structuring element is
approximated by a sequence of N periodic-line structuring elements. When N
equals 0, no approximation is used, and the structuring element members consist
of all pixels whose centers are no greater than R away from the origin. If N is not
specified, the default value is 4.
34
Fig 4.5 Structuring Element
4.4.2. IMAGE CLOSING
Closing is an important operator from the field of mathematical
morphology. Like its dual operator opening, it can be derived from the
fundamental operations of erosion and dilation. Like those operators it is
normally applied to binary images, although there are gray level versions.
Closing is similar in some ways to dilation in that it tends to enlarge the
boundaries of foreground (bright) regions in an image (and shrink background
color holes in such regions), but it is less destructive of the original boundary
shape. As with other morphological operators, the exact operation is determined
by a structuring element. The effect of the operator is to preserve background
regions that have a similar shape to this structuring element, or that can
completely contain the structuring element, while eliminating all other regions of
background pixels.
Closing is opening performed in reverse. It is defined simply as dilation
followed by erosion using the same structuring element for both operations. See
the sections on erosion and dilation for details of the individual steps. The closing
35
operator therefore requires two inputs: an image to be closed and a structuring
element.
Gray level closing consists straightforwardly of a gray level dilation
followed by gray level erosion.
Closing is the dual of opening, i.e. closing the foreground pixels with a
particular structuring element, is equivalent to closing the background with the
same element.
4.4.3. EFFECT OF IMAGE CLOSING
One of the uses of dilation is to fill in small background color holes in
images, e.g. `pepper noise'. One of the problems with doing this, however, is that
the dilation will also distort all regions of pixels indiscriminately. By performing
an erosion on the image after the dilation, i.e. a closing, we reduce some of this
effect. The effect of closing can be quite easily visualized. Imagine taking the
structuring element and sliding it around outside each foreground region, without
changing its orientation. For any background boundary point, if the structuring
element can be made to touch that point, without any part of the element being
inside a foreground region, then that point remains background. If this is not
possible, then the pixel is set to foreground. After the closing has been carried out
the background region will be such that the structuring element can be made to
cover any point in the background without any part of it also covering a
foreground point, and so further closings will have no effect. This property is
known as idempotence. The effect of a closing on a binary image using a 3×3
square structuring element is illustrated in Fig 4.6.
36
Fig 4.6 Effect of closing using a 3×3 square structuring element
As with erosion and dilation, this particular 3×3 structuring element is
the most commonly used, and in fact many implementations will have it
hardwired into their code, in which case it is obviously not necessary to specify a
separate structuring element. To achieve the effect of a closing with a larger
structuring element, it is possible to perform multiple dilations followed by the
same number of erosions.
Closing can sometimes be used to selectively fill in particular background
regions of an image. Whether or not this can be done depends upon whether a
suitable structuring element can be found that fits well inside regions that are to
be preserved, but is doesn't fit inside regions that are to be removed.
37
4.5. IMAGE REGISTRATION
Image registration is the process of overlaying two or more images of the
same scene taken at different times, from different viewpoints, and/or by
different sensors. It geometrically aligns two images—the reference and sensed
images. The present differences between images are introduced due to different
imaging conditions. Image registration is a crucial step in all image analysis tasks
in which the final information is gained from the combination of various data
sources like in image fusion, change detection, and multichannel image
restoration. Typically, registration is required in remote sensing (multispectral
classification, environmental monitoring, change detection, image mosaicing,
weather forecasting, creating super-resolution images, integrating information
into geographic information systems (GIS)), in medicine (combining computer
tomography (CT) and NMR data to obtain more complete information about the
patient, monitoring tumor growth, treatment verification, comparison of the
patient’s data with anatomical atlases), in cartography (map updating), and in
computer vision (target localization, automatic quality control), to name a few.
4.5.1. ALGORITHM CLASSIFICATIONS
4.5.1.1. INTENSITY BASED VS FEATURE BASED
Image registration or image alignment algorithms can be classified into
intensity-based and feature-based. One of the images is referred to as the
reference or source and the second image is referred to as the target or sensed.
Image registration involves spatially transforming the target image to align with
the reference image. Intensity-based methods compare intensity patterns in
38
images via correlation metrics, while feature-based methods find correspondence
between image features such as points, lines, and contours. Intensity-based
methods register entire images or sub images. If sub images are registered,
centers of corresponding sub images are treated as corresponding feature points.
Feature-based method established correspondence between a number of points in
images. Knowing the correspondence between a number of points in images, a
transformation is then determined to map the target image to the reference
images, thereby establishing point-by-point correspondence between the
reference and target images.
4.5.1.2. SPATIAL VS FREQUENCY DOMAIN METHODS
Spatial methods operate in the image domain, matching intensity patterns
or features in images. Some of the feature matching algorithms are outgrowths of
traditional techniques for performing manual image registration, in which an
operator chooses corresponding control points (CPs) in images. When the
number of control points exceeds the minimum required to define the appropriate
transformation model, iterative algorithms like RANSAC can be used to robustly
estimate the parameters of a particular transformation type (e.g. affine) for
registration of the images.
Frequency-domain methods find the transformation parameters for
registration of the images while working in the transform domain. Such methods
work for simple transformations, such as translation, rotation, and scaling.
Applying the Phase correlation method to a pair of images produces a third image
which contains a single peak. The location of this peak corresponds to the relative
translation between the images. Unlike many spatial-domain algorithms, the
phase correlation method is resilient to noise, occlusions, and other defects
typical of medical or satellite images. Additionally, the phase correlation uses the
39
fast Fourier transform to compute the cross-correlation between the two images,
generally resulting in large performance gains. The method can be extended to
determine rotation and scaling differences between two images by first
converting the images to log-polar coordinates. Due to properties of the Fourier
transform, the rotation and scaling parameters can be determined in a manner
invariant to translation.
4.5.1.3. SINGLE VS MULTI- MODALIY METHODS
Another classification can be made between single-modality and
multi-modality methods. Single-modality methods tend to register images in the
same modality acquired by the same scanner/sensor type, while multi-modality
registration methods tended to register images acquired by different
scanner/sensor types.
Multi-modality registration methods are often used in medical
imaging as images of a subject are frequently obtained from different scanners.
Examples include registration of brain CT/MRI images or whole body PET/CT
images for tumor localization, registration of contrast-enhanced CT images
against non-contrast-enhanced CT images for segmentation of specific parts of
the anatomy, and registration of ultrasound and CT images for prostate
localization in radiotherapy.
4.5.1.4. AUTOMATIC VS INTERACTIVE METHODS
Registration methods may be classified based on the level of
automation they provide. Manual, interactive, semi-automatic, and automatic
methods have been developed. Manual methods provide tools to align the images
manually. Interactive methods reduce user bias by performing certain key
40
operations automatically while still relying on the user to guide the registration.
Semi-automatic methods perform more of the registration steps automatically but
depend on the user to verify the correctness of a registration. Automatic methods
do not allow any user interaction and perform all registration steps automatically.
4.5.2. UNCERTAINITY
There is a level of uncertainty associated with registering images that
have any spatio-temporal differences. A confident registration with a measure of
uncertainty is critical for many change detection applications such as medical
diagnostics.
In remote sensing applications where a digital image pixel may
represent several kilometers of spatial distance (such as NASA's LANDSAT
imagery), an uncertain image registration can mean that a solution could be
several kilometers from ground truth. Several notable papers have attempted to
quantify uncertainty in image registration in order to compare results. However,
many approaches to quantifying uncertainty or estimating deformations are
computationally intensive or are only applicable to limited sets of spatial
transformations.
4.5.3. TRANSFORMATION METHODS
Image registration algorithms can also be classified according to the
transformation models they use to relate the target image space to the reference
image space.
The first broad category of transformation models includes linear
transformations, which include translation, rotation, scaling, and other affine
41
transforms. Linear transformations are global in nature, thus, they cannot model
local geometric differences between images.
The second category of transformations allow 'elastic' or 'nonrigid'
transformations. These transformations are capable of locally warping the target
image to align with the reference image. Nonrigid transformations include radial
basis functions (thin-plate or surface splines, multiquadrics, and compactly-
supported transformations), physical continuum models (viscous fluids), and
large deformation models (diffeomorphisms).
4.5.4. RADON TRANSFORM
The Radon transform of a 2-D function f (x, y) is defined as:
where r is the perpendicular distance of a line from the origin and q is the angle
between the line and the y-axis. According to the Fourier slice theorem, this
transformation is invertible and the 1-D Fourier transforms of the Radon
transform along r are the 1-D radial samples of the 2-D Fourier transform of
f (x, y) at the corresponding angles. The transform we have used is radon
transform. The RADON function computes the Radon transform, which is the
projection of the image intensity along a radial line oriented at a specific angle.
R = RADON(I,THETA) returns the Radon transform of the intensity
image I for the angle THETA degrees. If THETA is a scalar, the result R is a
42
column vector containing the Radon transform for THETA degrees. If THETA is
a vector, then R is a matrix in which each column is the Radon transform for one
of the angles in THETA. If you omit THETA, it defaults to 0:179.
[R,Xp] = RADON(...) returns a vector Xp containing the radial coordinates
corresponding to each row of R.
4.6. FEATURE EXTRACTION
In pattern recognition and in image processing, feature extraction is a
special form of dimensionality reduction.
When the input data to an algorithm is too large to be processed and it is
suspected to be notoriously redundant (much data, but not much information)
then the input data will be transformed into a reduced representation set of
features (also named features vector). Transforming the input data into the set of
features is called feature extraction. If the features extracted are carefully chosen
it is expected that the features set will extract the relevant information from the
input data in order to perform the desired task using this reduced representation
instead of the full size input
Feature extraction involves simplifying the amount of resources required to
describe a large set of data accurately. When performing analysis of complex
data one of the major problems stems from the number of variables involved.
Analysis with a large number of variables generally requires a large amount of
memory and computation power or a classification algorithm which overfits the
training sample and generalizes poorly to new samples. Feature extraction is a
43
general term for methods of constructing combinations of the variables to get
around these problems while still describing the data with sufficient accuracy.
The discrete wavelets transform (DWT) decomposes an input signal into
low and high frequency component using a filter bank. Haar wavelet, which
characteristics the filter bank, has important properties of orthogonality, linearity,
and completeness. We can repeat the DWT multiple times to multiple-level
resolution of different octaves. For each level, wavelets can be separated into
different basis functions for image compression and recognition.
Fig 4.7 Multi-resolution expansion using Haar wavelet
The wavelet transform can be used to represent a two-dimensional (2D)
signal by the 2D resolution decomposition procedure, where an image is
repeatedly decomposed into an approximation and several detail components at
each level.
In order to construct the wavelet pyramid, we decide the number of Haar
coefficients and approximation levels. We would like to extract salient points
44
from any part of the image where “something” happens in the image at any
resolution. A high wavelet coefficient (in absolute value) at a coarse resolution
corresponds to a region with high global variations. The properly chosen length
of the Haar wavelet and the number of the approximation levels provides the
optimum local key points or features.
4.6.1. WAVELETS
The Wavelets analysis is performed using a prototype function called
wavelet, which has the effect of a band pass filter. Wavelets are functions defined
over a finite interval and having an average value of zero. The basic idea of
wavelet transform is to represent any arbitrary functions f (t) as a superposition of
a set of such wavelets or basis function. These basis functions are derived from a
single prototype mother wavelet.
The term wavelet means a small wave. The smallness refers to the
condition that this window function is of finite length (compactly supported). The
‘wave’ refers to the condition that this function is oscillatory. The term ‘mother’
implies that the functions with different region of support that are used in the
transformation process are derived from (scaling) and translations (shifts).
45
4.6.2. WAVELET TRANSFORM
The wavelet transform is a mathematical tool that decomposes a signal in
to a representation that shows signal details and trends as a function of time. It is
used to characterize transient, reduce noise, compress data, and perform many
other operations.
Wavelet analysis is a windowing technique, similar to the STFT, with the
variable -sized windows. Wavelet analysis is capable of revealing aspects of data
that other signal analysis techniques miss, including aspects such as trends,
breakdown points, discontinuities, and self-similarity. It is also often used to
compress or denoise a signal without any appreciable degradation.
4.6.3. THE DISCRETE WAVELET TRANSFORM
“Discrete Wavelet Transform”, transforms discrete signal from time
domain in to time-frequency domain. The transformation product is set of
coefficients organized in the way that enables not only spectrum analyses of the
signal, but also spectral behavior of the signal in time. This is achieved by
decomposing signal, breaking into two components, each caring information
about source signal. Filters from the filter bank used for decomposition come in
pairs: low pass and high pass. The filtering is succeeded by down sampling
(obtained filtering result is “re-sampled” so that every second coefficient is kept).
Low pass filtered signal contains information about slow changing component of
the signal, looking very similar to the original signal, only two times shorter in
term of samples. High pass filtered signal contains information about fast
changing component of the signal. In most cases high pass component is not so
rich with data offering good property for compression. In some cases, such as
46
audio or video signal, it is possible to discard some of the samples of the high
pass component without noticing any significant changes in signal. Filters from
filter bank are called “wavelets”.
4.6.4. 2D-DISCRETE WAVELET TRANSFORM
The two-dimensional DWT can be implemented using digital filters and
down samplers. With separable two-dimensional scaling and wavelet functions,
we get one approximation coefficients and three sets of detail coefficients such as
horizontal, vertical and diagonal coefficients.
The concepts of one-dimensional DWT and its implementation through
sub band coding can be easily extended to two-dimensional signals for digital
images. In case of sub band analysis of images, we require extraction of its
approximate forms in both horizontal and vertical directions, details in horizontal
direction alone (detection of horizontal edges), details in vertical direction alone
(detection of vertical edges) and details in both horizontal and vertical directions
(detection of diagonal edges). This analysis of 2-D signals require the use of
following two-dimensional filter functions through the multiplication of
separable scaling and wavelet functions in (horizontal) and (vertical) directions,
as defined below:
47
represents the approximated
signal, signal with horizontal details, signal with vertical details and signals with
diagonal details respectively.
48
CHAPTER 5
OVERVIEW OF THE PROJECT
5.1. AN OVERLAY OF OUR ALGORITHM
Fig 5.1 Overlay of our algorithm
CAPTURED HAND
GESTURE
MORPHOLOGICAL OPERATION
HAND SEGMENTATION
GRAY SCALED IMAGE
RESIZING IMAGE
IMAGE NORMALIZATION
HAAR WAVELET TRANSFORM
FEATURE EXTRACTION
RECOGNISED
SPEECH SIGNAL
49
5.2. PROPOSED WORK
Here we have approached gesture recognition through image processing.
With a constraint of constant background and constant zoom level we have
tracked the hand gesture. The captured image is generalized to a common size so
that the machine has a constant frame size to process. The RGB image requires a
threshold value for each sub-image. So to reduce such complexity we have
converted it into a gray scale image. The hand is then extracted using the skin
color approach and we assume that the front arm of user is cover by clothes. A
pixel is defined as a skin pixel if it satisfies the following condition
Gray scale < Threshold
Where the gray scale denotes the intensity value of the input hand image.
By the skin color approach, a hand map image can then be defined. In the
hand map image, a white pixel (pixel value =1) and a black pixel (pixel value =
0) indicate the skin and non skin pixels respectively.
Now the segmented hand image is binary and it undergoes many
preprocessing. The binary image obtained has noise and does not figure out the
exact hand geometry. Because of noise in a hand image, holes are resulted which
are then minimized by utilizing morphological operations. A structuring element
of suitable size is designed and image closing is done to get the exact geometry
of the hand.
In morphology dilation expands an image and erosion shrinks it. Closing
tends to smooth contours but it generally fuses narrow breaks and long thin gulfs,
it eliminates some small holes, and fills gaps in the contour.
50
The closing of set A by structuring element B, denoted by
Which, in words, says that the closing of A by B is simply the dilation of A
by B, followed by the erosion of the result by B.
The gesture captured at different time intervals may be at different angular
position. Our next step is to normalize the segmented hand to a common axis.
Image registration is performed to rotate the image to a constant image axis
thereby reducing the number of training images in the database. This
normalization is done with the help of radon transform.
There are several choices for the selection of features inorder to
discriminate between hands in a hand gesture recognition system.
In numerical analysis and functional analysis, a discrete wavelet
transform (DWT) is any wavelet transform for which the wavelets are discretely
sampled. As with other wavelet transforms, a key advantage it has over Fourier
transforms is temporal resolution, it captures both frequency and location
information.
For an input represented by a list of 2n numbers, the Haar wavelet
transform may be considered to simply pair up input values, storing the
difference and passing the sum. This process is repeated recursively, pairing up
the sums to provide the next scale, finally resulting in 2n − 1 differences and one
final sum.
The feature used here is the horizontal component of the discrete wavelet
transformed image. This vector component which is different from other
statistical method gives a better recognition rate.
BBABA Θ⊕=• )(
51
This vector component is correlated with the training image. Correlation is
a measure of how well the predicted values from a forecast model "fit" with the
real-life data.
The correlation coefficient is a number between 0 and 1. If there is no
relationship between the predicted values and the actual values the correlation
coefficient is 0 or very low (the predicted values are no better than random
numbers). As the strength of the relationship between the predicted values and
actual values increases so does the correlation coefficient. A perfect fit gives a
coefficient of 1.0. Thus higher the correlation coefficient, better the recognition.
Then comes our speech signal unit which plays the corresponding wave
file. The recognized image has to be authenticated by a sound file. This would
make the ordinary speaking community to easily understand what the dumb
people mean to say. In this part we have saved a .wav file for each alphabet. The
.wav file for that particular gesture is read and played. This is done by using
wavread and wavplay command.
As this project computes wavelet and finds the correlation for a particular
gesture which is very easy for computation compared to other gesture recognition
system it has an added advantage. Moreover the hardware requirement is
reduced.
52
CHAPTER 6
SOFTWARE DESCRIPTION
6.1. INTRODUCTION
The simulation tool used for the development of the software is MATLAB.
MATLAB stands for Matrix Laboratory. It is a technical computing environment
for high performance numeric computation and visualization. It indicates
numerical analysis, matrix computation, signal processing and graphics in an
easy to use environment, where problems and solutions are expressed just as they
are written mathematically, without traditional programming. MATLAB allows
us to express the entire algorithm in few dozen lines to compute the solutions
with great accuracy in a few minutes on a computer, and to readily manipulate a
tree dimensional display of the result in color.
The basic building block of MATLAB is the matrix. The fundamental
data-type is the array, vectors, scalars, real matrices and complex matrices are all
automatically handled as special cases of fundamental data types. It also features
a family of applications specific solutions called ‘Tool Boxes’. Areas in which
tool boxes are available include signal processing, image processing, control
systems designs, dynamic system simulations, system identifications, neural
networks, wavelength communications and others. These tool boxes are
collections of functions written for special applications such as symbolic
computing, image processing, neural networks etc.
53
6.2. FEATURES OF MATLAB
Some of the special features of MATLAB are
6.2.1. COMMAND WINDOW
This is the main window. It is characterized by the MATLAB command
prompt’>>’, when the application is launched the user is taken to this prompt. All
commands including those for running user-written programs are typed in the
MATLAB prompt.
6.2.2. GRAPHICS WINDOW
The outputs of all the graphics are flushed to the graphics (or) figure
window, a separate gray window with (default) white background. The user can
create as many figure windows, as the memory will allow.
6.2.3. EDIT WINDOW
This is where we write, edit, create and save our own programs in M-files.
Any text editor can be used to carry out these tasks. On most systems such as PCs
and Macs MATLAB provides its own built-in editor. On other systems standard
file editing program is invoked by typing a command prompt.
6.2.4. INPUT OUTPUT
MATLAB supports interactive computation, taking the input from the
screen and flushing the output to the screen. In addition, it reads input files and
writes output files. The following features hold all forms of input- output.
6.2.5. DATA TYPE
The fundamental data type in the MATLAB is the array. It encompasses
several distinct data objects, integers, matrices, doubles, character strings,
structures and cells. In most cases, however data type (or) data object declaration
is not needed.
54
6.2.6. DIMENSIONING
Dimensioning is automatic in MATLAB. No dimension statement is
required in vectors (or) arrays. The command ‘size’ and ‘length’ yields the
dimension of an existing matrix (or) vector.
6.2.7. CASE SENSITIVITY
MATLAB is case sensitive, i.e. it differentiates lowercase and uppercase
letters. Thus ’a’ and ‘A’ are different variables. Most MATLAB commands and
built-in function calls are typed in lowercase letters.
6.3. IMAGES IN MATLAB
The basic data structure in MATLAB is the array, an ordered set of real or
complex elements. This object is naturally suited to the representation of images,
real-valued, ordered sets of color or intensity data MATLAB stores. Most images
as two-dimensional arrays, in which each element of the matrix corresponds to a
single pixel in the displayed image. For example an image composed of 200 rows
and 300 columns of different colored dots are stored in MATLAB as 200 by 300
matrix.
By default, MATLAB stores most data in arrays of class double. The data
in these arrays is stored as double precision (64-bit) floating-point numbers. All
of MATLAB’s function and capabilities work with these arrays.
The number of pixels in an image may be large; for example a 1000 by
1000 image has a million pixels. Since each pixel is represented by at least one
array element, this image would require about 8 megabytes of memory.
In order to reduce memory requirements, MATLAB supports storing
image data in arrays of class unit 8. The data in these arrays requires one eighth
as much memory as data in double arrays. Because the types of values that can be
55
stored in unit 8 arrays and double arrays differ, the image processing toolbox uses
different conventions for interpreting the values in these arrays.
6.4. FILE TYPES
MATLAB has four types of files for storing information. They are
• M-files.
• Script files
• Function files.
• MAT files
6.4.1. M-FILES
M-files are standard ASCII text files with an .m extension to the file name.
There are two types of these files namely script files and function files. Most
programs written in MATLAB are saved as M-files. All built in functions are
provided with source code in readable form so that they can be copied and
modified.
6.4.2. SCRIPT FILES
Script files are an M-file with a valid set of MATLAB commands in it. A
script file is executed by typing the name of the file (without the ’in’ extension)
on the commands stored in the script file, one by one at the MATLAB prompt.
Naturally script files work on the global variables i.e. variables currently present
in the work space.
A script file may contain any number of commands; including those that
call built-in functions written by the user. Script files are useful when a certain
set of commands has to be repeated several times.
56
6.4.3. FUNCTION FILES
A function file is also an M-file, like a script file, except that the variables
in a function file are all local. Function files are like programs or subroutine in
FORTRAN, procedures in PASCAL and functions in C.
A Function file begins with a function definition line, which has a well-
defined list of inputs and outputs. Without this line the file becomes a script file.
The syntax of function definition line is as follows:
Function[Output Variable]=Function-name(input variable)
Where the function name should be the file name in which the function is written.
6.4.4. MAT-FILES
MAT-Files are binary data files with a ‘mat’ extension. MAT-Files are created by
MATLAB when a data is saved with ‘save’ command. The data is written in a
special format, which only MATLAB can decode. MAT-Files can be loaded in to
the MATLAB using the load command.
57
CHAPTER 7
SIMULATION RESULT
STEP: 1
The input image from different camera may have varying dimensions
(M x N). So to standardize the size of the input image which is to be processed
we resize it.
Fig 7.1 Resized image
58
STEP: 2
The RGB image obtained is converted to gray scale for easier thresholding
process. This gray scale image is a combination of the three sub colors red, green
and blue.
Fig 7.2 Gray scale image
59
STEP: 3
Here we segment the hand region from the background by choosing an
appropriate threshold value. This process gives an outline for the hand region that
need to be processed.
Fig 7.3 Segmented hand
60
STEP: 4
In this step we create a structuring element and perform the image closing
operation. This process removes the noise components and give an exact
geometry of the hand gesture.
Fig 7.4 Morphologically operated image
61
STEP: 5
Here we rotate the image to a common axis. This greatly reduces the
number of images in the database.
Fig 7.5 Normalized image
62
STEP: 6
This step involves the feature extraction. The horizontal component of the
wavelet transformed test image is used for recognition. This step too reduces the
database size by great measure.
Fig 7.6 Horizontal vector of DWT
63
CHAPTER 8
CONCLUSION
The inspiration behind this project came from the thought of helping to
alleviate the language barrier which stands between the dumb and hearing
communities. Attempting to translate finger spelling to a spoken English alphabet
was just a minute step towards achieving this ultimate goal. The resulting gesture
recognition approach achieved this desired step.The discrete wavelet concept and
normalization of the image axis helps to reduce the database size. Performing
wavelet transform is time efficient as it is easier for computation. The
normalization also provides an uniform pattern to correlate with the training
image and give out the corresponding speech signal which it matches.Our method
seems to be more promising as there is a substantial reduction in error rate and
processing time.
64
CHAPTER 9
REFERENCES
[1]Wing Kwong Chung, Xinyu Wu, and Yangsheng Xu, “A Realtime Hand
Gesture Recognition based on Haar Wavelet Representation”, Proceedings of the
2008 IEEE International Conference on Robotics and Biomimetics Bangkok,
Thailand, February, 2009.
[2] J. Allen, P. Asselin, and R. Foulds , “American Sign Language Finger
Spelling Recognition System”, Proceedings of Bioengineering Conference,
March, 2003.
[3] H. Brashear, T. Starner, P. Lukowicz, H. Junker, “Using multiple sensors for
mobile sign language recognition”, Proceedings of IEEE International
Symposium on Wearable Computers, pp. 45-52, October 2003.
[4] J. H. Shin, J. S. Lee, S. K. Kil, D. F. Shen, J. G. Ryu, E. H. H. K. Min, and S.
H. Hong, “Hand Region Extraction and Gesture Recognition using entropy
analysis”, Proceedings of International Journal of Computer Science and
Network Security, Vol. 6 No. 2 216 222, February 2006.
[5] C.L. Huang and W.Y. Huang, “Sign language recognition using model based
tracking and a 3D Hopfield neural network”, Machine Vision Application, Vol.
10, pp. 292301, 1998.
[6] G. Gomez, M. Sanchez, and L. E. Sucar, “On selecting an appropriate colour
space for skin detection”, Proceedings of Mexican International Conference on
Artificial Intelligence, Yucatan, pp. 69-78, 2002.